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Introduction 


he term machine learning has all sorts of meanings attached to it today, 

especially after Hollywood’s (and others’) movie studios have gotten into 

the picture. Films such as Ex Machina have tantalized the imaginations of 
moviegoers the world over and made machine learning into all sorts of things that 
it really isn’t. Of course, most of us have to live in the real world, where machine 
learning actually does perform an incredible array of tasks that have nothing to do 
with androids that can pass the Turing Test (fooling their makers into believing 
they’re human). Machine Learning For Dummies provides you with a view of machine 
learning in the real world and exposes you to the amazing feats you really can 
perform using this technology. Even though the tasks that you perform using 
machine learning may seem a bit mundane when compared to the movie version, 
by the time you finish this book, you realize that these mundane tasks have the 
power to impact the lives of everyone on the planet in nearly every aspect of their 
daily lives. In short, machine learning is an incredible technology — just not in 
the way that some people have imagined. 


About This Book 


The main purpose of Machine Learning For Dummies is to help you understand what 
machine learning can and can’t do for you today and what it might do for you in 
the future. You don’t have to be a computer scientist to use this book, even though 
it does contain many coding examples. In fact, you can come from any discipline 
that heavily emphasizes math because that’s how this book focuses on machine 
learning. Instead of dealing with abstractions, you see the concrete results of 
using specific algorithms to interact with big data in particular ways to obtain a 
certain, useful result. The emphasis is on useful because machine learning has the 
power to perform a wide array of tasks in a manner never seen before. 


Part of the emphasis of this book is on using the right tools. This book uses both 
Python and R to perform various tasks. These two languages have special features 
that make them particularly useful in a machine learning setting. For example, 
Python provides access to a huge array of libraries that let you do just about any- 
thing you can imagine and more than a few you can’t. Likewise, R provides an ease 
of use that few languages can match. Machine Learning For Dummies helps you under- 
stand that both languages have their role to play and gives examples of when one 
language works a bit better than the other to achieve the goals you have in mind. 
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You also discover some interesting techniques in this book. The most important is 
that you don’t just see the algorithms used to perform tasks; you also get an 
explanation of how the algorithms work. Unlike many other books, Machine Learn- 
ing For Dummies enables you to fully understand what you’re doing, but without 
requiring you to have a PhD in math. After you read this book, you finally have a 
basis on which to build your knowledge and go even further in using machine 
learning to perform tasks in your specific field. 


Of course, you might still be worried about the whole programming environment 
issue, and this book doesn’t leave you in the dark there, either. At the beginning, 
you find complete installation instructions for both RStudio and Anaconda, which 
are the Integrated Development Environments (IDEs) used for this book. In addi- 
tion, quick primers (with references) help you understand the basic R and Python 
programming that you need to perform. The emphasis is on getting you up and 
running as quickly as possible, and to make examples straightforward and simple 
so that the code doesn’t become a stumbling block to learning. 


To help you absorb the concepts, this book uses the following conventions: 


>> Text that you're meant to type just as it appears in the book is in bold. The 
exception is when you're working through a step list: Because each step is 
bold, the text to type is not bold. 


>> Words that we want you to type in that are also in italics are used as place- 
holders, which means that you need to replace them with something that 
works for you. For example, if you see “Type Your Name and press Enter,” you 
need to replace Your Name with your actual name. 


>> We also use italics for terms we define. This means that you don't have to rely 
on other sources to provide the definitions you need. 


>> Web addresses and programming code appear in monofont. If you're reading 
a digital version of this book on a device connected to the Internet, you can 
click the live link to visit that website, like this: http: //www. dummies. com. 


>> When you need to click command sequences, you see them separated by a 
special arrow, like this: File ®© New File, which tells you to click File and then 
New File. 


Foolish Assumptions 


You might find it difficult to believe that we’ve assumed anything about you — 
after all, we haven’t even met you yet! Although most assumptions are indeed 
foolish, we made certain assumptions to provide a starting point for the book. 
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The first assumption is that you’re familiar with the platform you want to use 
because the book doesn’t provide any guidance in this regard. (Chapter 4 does, 
however, provide RStudio installation instructions, and Chapter 6 tells you how to 
install Anaconda.) To give you the maximum information about R and Python with 
regard to machine learning, this book doesn’t discuss any platform-specific issues. 
You really do need to know how to install applications, use applications, and gen- 
erally work with your chosen platform before you begin working with this book. 


This book isn’t a math primer. Yes, you see lots of examples of complex math, but 
the emphasis is on helping you use R, Python, and machine learning to perform 
analysis tasks rather than learn math theory. However, you do get explanations of 
many of the algorithms used in the book so that you can understand how the algo- 
rithms work. Chapters 1 and 2 guide you through a better understanding of pre- 
cisely what you need to know in order to use this book successfully. 


This book also assumes that you can access items on the Internet. Sprinkled 
throughout are numerous references to online material that will enhance your 
learning experience. However, these added sources are useful only if you actually 
find and use them. 


Icons Used in This Book 


TIP 


WARNING 


ce 


TECHNICAL 
STUFF 





As you read this book, you encounter icons in the margins that indicate material 
of interest (or not, as the case may be). Here’s what the icons mean: 


Tips are nice because they help you save time or perform some task without a lot 
of extra work. The tips in this book are time-saving techniques or pointers to 
resources that you should try so that you can get the maximum benefit from R or 
Python, or in performing machine learning-related tasks. 


We don’t want to sound like angry parents or some kind of maniacs, but you 
should avoid doing anything that’s marked with a Warning icon. Otherwise, you 
might find that your application fails to work as expected, you get incorrect 
answers from seemingly bulletproof equations, or (in the worst-case scenario) 
you lose data. 


Whenever you see this icon, think advanced tip or technique. You might find these 
tidbits of useful information just too boring for words, or they could contain the 
solution you need to get a program running. Skip these bits of information when- 
ever you like. 
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REMEMBER 


ON THE WEB 


If you don’t get anything else out of a particular chapter or section, remember the 
material marked by this icon. This text usually contains an essential process or a 
bit of information that you must know to work with R or Python, or to perform 
machine learning-related tasks successfully. 


RStudio and Anaconda come equipped to perform a wide range of general tasks. 
However, machine learning also requires that you perform some specific tasks, 
which means downloading additional support from the web. This icon indicates 
that the following text contains a reference to an online source that you need to 
know about, and that you need to pay particular attention to so that you install 
everything needed to make the examples work. 


Beyond the Book 


4 


This book isn’t the end of your R, Python, or machine learning experience — it’s 
really just the beginning. We provide online content to make this book more flex- 
ible and better able to meet your needs. That way, as we receive email from you, 
we can address questions and tell you how updates to R, Python, or their associ- 
ated add-ons affect book content. In fact, you gain access to all these cool 
additions: 


>> Cheat sheet: You remember using crib notes in school to make a better mark 
on a test, don't you? You do? Well, a cheat sheet is sort of like that. It provides 
you with some special notes about tasks that you can do with R, Python, 
RStudio, Anaconda, and machine learning that not every other person knows. 
To view this book's Cheat Sheet, simply go to www. dummies . com and search 
for “Machine Learning For Dummies Cheat Sheet” in the Search box. It 
contains really neat information such as finding the algorithms you commonly 
need for machine learning. 


>> Updates: Sometimes changes happen. For example, we might not have seen 
an upcoming change when we looked into our crystal ball during the writing 
of this book. In the past, this possibility simply meant that the book became 
outdated and less useful, but you can now find updates to the book at 
http://www. dummies .com/extras/machinelearning. 


In addition to these updates, check out the blog posts with answers to reader 
questions and demonstrations of useful book-related techniques at http: // 
blog. johnmuel lerbooks.com/. 


>> Companion files: Hey! Who really wants to type all the code in the book and 
reconstruct all those plots manually? Most readers prefer to spend their time 
actually working with R, Python, performing machine learning tasks, and 
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seeing the interesting things they can do, rather than typing. Fortunately for 
you, the examples used in the book are available for download, so all you 
need to do is read the book to learn machine learning usage techniques. You 
can find these files at http: //www. dummies. com/extras/machinelearning. 


Where to Go from Here 


It’s time to start your machine learning adventure! If you’re completely new to 
machine learning tasks, you should start with Chapter 1 and progress through the 
book at a pace that allows you to absorb as much of the material as possible. Make 
sure to read about both R and Python because the book uses both languages as 
needed for the examples. 


If you’re a novice who’s in an absolute rush to get going with machine learning as 
quickly as possible, you can skip to Chapter 4 with the understanding that you 
may find some topics a bit confusing later. If you already have RStudio installed, 
you can skim Chapter 4. Likewise, if you already have Anaconda installed, you can 
skim Chapter 6. To use this book, you must install R version 3.2.3. The Python 
version we use is 2.7.11. The examples won’t work with the 3.x version of Python 
because this version doesn’t support some of the libraries we use. 


Readers who have some exposure to both R and Python, and have the appropriate 
language versions installed, can save reading time by moving directly to Chapter 8. 
You can always go back to earlier chapters as necessary when you have questions. 
However, you do need to understand how each technique works before moving to 
the next one. Every technique, coding example, and procedure has important 
lessons for you, and you could miss vital content if you start skipping too much 
information. 


Introduction 5 


Introducing How 
Machines Learn 


IN THIS PART... 





-= Discovering how Al really works and what it can do 
for you 


Considering what the term big data means 
Understanding the role of statistics in machine learning 
Defining where machine learning will take society in 
the future 


IN THIS CHAPTER 


Getting beyond the hype of artificial 
intelligence (Al) 


Defining the dream of Al 


Differentiating between the real 
world and fantasy 


Comparing Al to machine learning 


Understanding the engineering 
portion of Al and machine learning 


Delineating where engineering ends 
and art begins 


Getting the Real 
Story about Al 


rtificial Intelligence (AI) is a huge topic today, and it’s getting bigger all 

the time thanks to the success of technologies such as Siri (http://www. 

apple.com/ios/siri/). Talking to your smartphone is both fun and help- 
ful to find out things like the location of the best sushi restaurant in town or to 
discover how to get to the concert hall. As you talk to your smartphone, it learns 
more about the way you talk and makes fewer mistakes in understanding your 
requests. The capability of your smartphone to learn and interpret your particular 
way of speaking is an example of an AI, and part of the technology used to make 
it happen is machine learning. You likely make limited use of machine learning 
and AI all over the place today without really thinking about it. For example, the 
capability to speak to devices and have them actually do what you intend is an 
example of machine learning at work. Likewise, recommender systems, such as 
those found on Amazon, help you make purchases based on criteria such as 
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previous product purchases or products that complement a current choice. The 
use of both AI and machine learning will only increase with time. 


In this chapter, you delve into AI and discover what it means from several per- 
spectives, including how it affects you as a consumer and as a scientist or engi- 
neer. You also discover that AI doesn’t equal machine learning, even though the 
media often confuse the two. Machine learning is definitely different from AI, 
even though the two are related. 


Moving beyond the Hype 


10 





REMEMBER 


As any technology becomes bigger, so does the hype, and AI certainly has a lot of 
hype surrounding it. For one thing, some people have decided to engage in fear 
mongering rather than science. Killer robots, such as those found in the film The 
Terminator, really aren’t going to be the next big thing. Your first real experience 
with an android AI is more likely to be in the form a health care assistant (http: // 
magazine.good.is/articles/robots—elder-care-pepper-exoskeletons-— 
japan) or possibly as a coworker (http: //www.computerworld.com/article/ 
2990849/robotics/meet-the-virtual—woman-—who-may-take-your-—job.htm1). 
The reality is that you interact with AI and machine learning in far more mundane 
ways already. Part of the reason you need to read this chapter is to get past the 
hype and discover what AI can do for you today. 








You may also have heard machine learning and AI used interchangeably. AI 
includes machine learning, but machine learning doesn’t fully define AI. This 
chapter helps you understand the relationship between machine learning and AI 
so that you can better understand how this book helps you move into a technology 
that used to appear only within the confines of science fiction novels. 


Machine learning and AI both have strong engineering components. That is, you 
can quantify both technologies precisely based on theory (substantiated and tested 
explanations) rather than simply hypothesis (a suggested explanation for a phe- 
nomenon). In addition, both have strong science components, through which 
people test concepts and create new ideas of how expressing the thought process 
might be possible. Finally, machine learning also has an artistic component, and 
this is where a talented scientist can excel. In some cases, AI and machine learn- 
ing both seemingly defy logic, and only the true artist can make them work as 
expected. 
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YES, FULLY AUTONOMOUS WEAPONS EXIST 


Before people send us their latest dissertations about fully autonomous weapons, yes, 
some benighted souls are working on such technologies. You'll find some discussions of 
the ethics of Al in this book, but for the most part, the book focuses on positive, helpful 
uses of Al to aid humans, rather than kill them, because most Al research reflects these 
uses. You can find articles on the pros and cons of Al online, such as the Guardian arti- 
cle at http: //www. theguardian.com/technology/2@15/ jul/27/musk—wozniak— 
hawking—ban—ai-—autonomous—weapons. However, remember that these people are 
guessing — they don't actually know what the future of Al is. 


If you really must scare yourself, you can find all sorts of sites, such as http: // 

www. reachingcriticalwill.org/resources/fact-sheets/critical—issues/ 
7972-ful ly—autonomous-weapons, that discuss the issue of fully autonomous 
weapons in some depth. Sites such as Campaign to Stop Killer Robots (http : //www. 
stopkillerrobots.org/) can also fill in some details for you. We do encourage you 
to sign the letter banning autonomous weapons at http: //futureoflife.org/ 
open-letter—autonomous—weapons/ — there truly is no need for them. 


However, it's important to remember that bans against space-based, chemical, and 
certain laser weapons all exist. Countries recognize that these weapons don't solve 
anything. Countries will also likely ban fully autonomous weapons simply because the 
citizenry won't stand for killer robots. The bottom line is that the focus of this book is on 
helping you understand machine learning in a positive light. 


Dreaming of Electric Sheep 


Androids (a specialized kind of robot that looks and acts like a human, such as Data 
in Star Trek) and some types of humanoid robots (a kind of robot that has human 
characteristics but is easily distinguished from a human, such as C-3PO in Star 
Wars) have become the poster children for AI. They present computers in a form 
that people can anthropomorphize. In fact, it’s entirely possible that one day you 
won’t be able to distinguish between human and artificial life with ease. Science 
fiction authors, such as Philip K. Dick, have long predicted such an occurrence, 
and it seems all too possible today. The story “Do Androids Dream of Electric 
Sheep?” discusses the whole concept of more real than real. The idea appears as 
part of the plot in the movie Blade Runner (http: //www. warnerbros.com/blade- 
runner). The sections that follow help you understand how close technology 
currently gets to the ideals presented by science fiction authors and the movies. 


CHAPTER 1 Getting the Real Story about Al 11 


12 





Se 


TECHNICAL 
STUFF 


The current state of the art is lifelike, but you can easily tell that you’re talking to 
an android. Viewing videos online can help you understand that androids that are 
indistinguishable from humans are nowhere near any sort of reality today. Check 
out the Japanese robots at https: //www. youtube. com/watch?v=MaT fzYDZG8c and 
http: //www.nbcnews.com/tech/innovation/humanoid-robot-starts-—work 
japanese—department-store-n345526. One of the more lifelike examples is 
Amelia (https: //vimeo.com/141610747). Her story appears on ComputerWorld 
at http://www. computerwor 1d .com/article/2990849/robotics/meet—the-virtual- 
woman-who-may-take-your-—job.html. The point is, technology is just starting to 
get to the point where people may eventually be able to create lifelike robots and 
androids, but they don’t exist today. 








Understanding the history of Al and 
machine learning 


There is a reason, other than anthropomorphization, that humans see the ulti- 
mate AI as one that is contained within some type of android. Ever since the 
ancient Greeks, humans have discussed the possibility of placing a mind inside a 
mechanical body. One such myth is that of a mechanical man called Talos (http: // 
www. ancient-wisdom.com/greekautomata.htm). The fact that the ancient Greeks 
had complex mechanical devices, only one of which still exists (read about the 
Antikythera mechanism at http://www.ancient-—wisdom.com/antikythera. 
htm), makes it quite likely that their dreams were built on more than just fantasy. 
Throughout the centuries, people have discussed mechanical persons capable of 
thought (such as Rabbi Judah Loew’s Golem, http: //www.nytimes.com/2009/ 
Q5/11/world/europe/11golem.htm1). 


Al is built on the hypothesis that mechanizing thought is possible. During the first 
millennium, Greek, Indian, and Chinese philosophers all worked on ways to per- 
form this task. As early as the seventeenth century, Gottfried Leibniz, Thomas 
Hobbes, and René Descartes discussed the potential for rationalizing all thought 
as simply math symbols. Of course, the complexity of the problem eluded them 
(and still eludes us today, despite the advances you read about in Part 3 of the 
book). The point is that the vision for AI has been around for an incredibly long 
time, but the implementation of AI is relatively new. 


The true birth of AI as we know it today began with Alan Turing’s publication of 
“Computing Machinery and Intelligence” in 1950. In this paper, Turing explored 
the idea of how to determine whether machines can think. Of course, this paper 
led to the Imitation Game involving three players. Player A is a computer and 
Player B is a human. Each must convince Player C (a human who can’t see either 
Player A or Player B) that they are human. If Player C can’t determine who is 
human and who isn’t on a consistent basis, the computer wins. 
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REMEMBER 


A continuing problem with AI is too much optimism. The problem that scientists 
are trying to solve with AI is incredibly complex. However, the early optimism of 
the 1950s and 1960s led scientists to believe that the world would produce intel- 
ligent machines in as little as 20 years. After all, machines were doing all sorts of 
amazing things, such as playing complex games. AI currently has its greatest suc- 
cess in areas such as logistics, data mining, and medical diagnosis. 


Exploring what machine learning can 
do for Al 


Machine learning relies on algorithms to analyze huge datasets. Currently, 
machine learning can’t provide the sort of AI that the movies present. Even the 
best algorithms can’t think, feel, present any form of self-awareness, or exercise 
free will. What machine learning can do is perform predictive analytics far faster 
than any human can. As a result, machine learning can help humans work more 
efficiently. The current state of AI, then, is one of performing analysis, but humans 
must still consider the implications of that analysis — making the required moral 
and ethical decisions. The “Considering the Relationship between AI and Machine 
Learning” section of this chapter delves more deeply into precisely how machine 
learning contributes to AI as a whole. The essence of the matter is that 
machine learning provides just the learning part of AI, and that part is nowhere 
near ready to create an AI of the sort you see in films. 


The main point of confusion between learning and intelligence is that people 
assume that simply because a machine gets better at its job (learning) it’s also 
aware (intelligence). Nothing supports this view of machine learning. The same 
phenomenon occurs when people assume that a computer is purposely causing 
problems for them. The computer can’t assign emotions and therefore acts only 
upon the input provided and the instruction contained within an application to 
process that input. A true AI will eventually occur when computers can finally 
emulate the clever combination used by nature: 


>» Genetics: Slow learning from one generation to the next 
>> Teaching: Fast learning from organized sources 


>» Exploration: Spontaneous learning through media and interactions with others 


Considering the goals of machine learning 


At present, AI is based on machine learning, and machine learning is essentially 
different from statistics. Yes, machine learning has a statistical basis, but it makes 
some different assumptions than statistics do because the goals are different. 
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TABLE 1-1: 


Technique 


Table 1-1 lists some features to consider when comparing AI and machine learning 
to statistics. 


Comparing Machine Learning to Statistics 


Machine Learning Statistics 


Data handling Works with big data in the form of networks Models are used to create predictive power 


and graphs; raw data from sensors or the on small samples. 
web text is split into training and test data. 











Data input The data is sampled, randomized, and Parameters interpret real world phenomena 
transformed to maximize accuracy scoring and provide a stress on magnitude. 
in the prediction of out of sample (or 
completely new) examples. 

Result Probability is taken into account for The output captures the variability and 
comparing what could be the best guess uncertainty of parameters. 
or decision. 

Assumptions The scientist learns from the data. The scientist assumes a certain output and 


tries to prove it. 











Distribution The distribution is unknown or ignored The scientist assumes a well-defined 
before learning from data. distribution. 

Fitting The scientist creates a best fit, but The result is fit to the present data 
generalizable, model. distribution. 


Defining machine learning limits based 
on hardware 


Huge datasets require huge amounts of memory. Unfortunately, the requirements 
don’t end there. When you have huge amounts of data and memory, you must also 
have processors with multiple cores and high speeds. One of the problems that 
scientists are striving to solve is how to use existing hardware more efficiently. In 
some cases, waiting for days to obtain a result to a machine learning problem 
simply isn’t possible. The scientists who want to know the answer need it quickly, 
even if the result isn’t quite right. With this in mind, investments in better hard- 
ware also require investments in better science. This book considers some of the 
following issues as part of making your machine learning experience better: 


>> Obtaining a useful result: As you work through the book, you discover that 
you need to obtain a useful result first, before you can refine it. In addition, 
sometimes tuning an algorithm goes too far and the result becomes quite 
fragile (and possibly useless outside a specific dataset). 
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>» Asking the right question: Many people get frustrated in trying to obtain an 
answer from machine learning because they keep tuning their algorithm 
without asking a different question. To use hardware efficiently, sometimes 
you must step back and review the question you're asking. The question 
might be wrong, which means that even the best hardware will never find the 
answer. 


>> Relying on intuition too heavily: All machine learning questions begin as a 
hypothesis. A scientist uses intuition to create a starting point for discovering 
the answer to a question. Failure is more common than success when working 
through a machine learning experience. Your intuition adds the art to the 
machine learning experience, but sometimes intuition is wrong and you have 
to revisit your assumptions. 


When you begin to realize the importance of environment to machine learning, 

ma you can also begin to understand the need for the right hardware and in the right 

WY, balance to obtain a desired result. The current state-of-the-art systems actually 

tecnica rely on Graphical Processing Units (GPUs) to perform machine learning tasks. 

iiai Relying on GPUs does speed the machine learning process considerably. A full 

discussion of using GPUs is outside the scope of this book, but you can read more 

about the topic at http://devblogs.nvidia.com/parallelforall/bidmach- 
machine-learning-limit-gpus/. 


Overcoming Al Fantasies 


As with many other technologies, AI and machine learning both have their 
fantasy or fad uses. For example, some people are using machine learning to 
create Picasso-like art from photos. You can read all about it at https ://www. 
washingtonpost .com/news/innovations/wp/2015/08/31/this-algorithm-can-create- 
a-new-van-gogh-or-picasso-in-just-an-hour/. Of course, the problems with 
such use are many. For one thing, it’s doubtful that anyone would really want a 
Picasso created in this manner except as a fad item (because no one had done it 
before). The point of art isn’t in creating an interesting interpretation of a par- 
ticular real-world representation, but rather in seeing how the artist interpreted 
it. The end of the article points out that the computer can only copy an existing 
style at this stage — not create an entirely new style of its own. The following 
sections discuss AI and machine learning fantasies of various sorts. 
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Discovering the fad uses of Al and 
machine learning 


AI is entering an era of innovation that you used to read about only in science 
fiction. It can be hard to determine whether a particular AI use is real or simply the 
dream child of a determined scientist. For example, The Six Million Dollar Man 
(https: //en.wikipedia.org/wiki/The_Six_Million_Dollar_Man) is a televi- 
sion series that looked fanciful at one time. When it was introduced, no one actu- 
ally thought that we’d have real world bionics at some point. However, Hugh Herr 
has other ideas — bionic legs really are possible now (http: //www.smithsonianmag. 
com/innovation/ future-robotic—legs-180953040/). Of course, they aren’t 
available for everyone yet; the technology is only now becoming useful. Muddying 
the waters is another television series, The Six Billion Dollar Man (http://www. 
cinemablend.com/new/Mark-—Wah1berg-Six-Billion—Dol lar—Man-—Just-Made 
Big-Change-91947.html). The fact is that AI and machine learning will both 
present opportunities to create some amazing technologies and that we’re already 
at the stage of creating those technologies, but you still need to take what you hear 
with a huge grain of salt. 





To make the future uses of AI and machine learning match the concepts that sci- 
ence fiction has presented over the years, real-world programmers, data scien- 
tists, and other stakeholders need to create tools. Chapter 8 explores some of the 

REMEMBER New tools that you might use when working with AI and machine learning, but 
these tools are still rudimentary. Nothing happens by magic, even though it may 
look like magic when you don’t know what’s happening behind the scenes. In 
order for the fad uses for AI and machine learning to become real-world uses, 
developers, data scientists, and others need to continue building real-world tools 
that may be hard to imagine at this point. 


Considering the true uses of Al and 
machine learning 


You find AI and machine learning used in a great many applications today. The 
only problem is that the technology works so well that you don’t know that it even 
exists. In fact, you might be surprised to find that many devices in your home 
already make use of both technologies. Both technologies definitely appear in your 
car and most especially in the workplace. In fact, the uses for both AI and machine 
learning number in the millions — all safely out of sight even when they’re quite 
dramatic in nature. Here are just a few of the ways in which you might see AI used: 


>> Fraud detection: You get a call from your credit card company asking 
whether you made a particular purchase. The credit card company isn't being 
nosy; it’s simply alerting you to the fact that someone else could be making a 
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purchase using your card. The Al embedded within the credit card company’s 
code detected an unfamiliar spending pattern and alerted someone to it. 


Resource scheduling: Many organizations need to schedule the use of 
resources efficiently. For example, a hospital may have to determine where to 
put a patient based on the patient's needs, availability of skilled experts, and 
the amount of time the doctor expects the patient to be in the hospital. 


Complex analysis: Humans often need help with complex analysis because 
there are literally too many factors to consider. For example, the same set of 
symptoms could indicate more than one problem. A doctor or other expert 
might need help making a diagnosis in a timely manner to save a patient's life. 


Automation: Any form of automation can benefit from the addition of Al to 
handle unexpected changes or events. A problem with some types of automa- 
tion today is that an unexpected event, such as an object in the wrong place, 
can actually cause the automation to stop. Adding Al to the automation can 
allow the automation to handle unexpected events and continue as if nothing 
happened. 


Customer service: The customer service line you call today may not even 
have a human behind it. The automation is good enough to follow scripts and 
use various resources to handle the vast majority of your questions. With 
good voice inflection (provided by Al as well), you may not even be able to tell 
that you're talking with a computer. 


Safety systems: Many of the safety systems found in machines of various 
sorts today rely on Al to take over the vehicle in a time of crisis. For example, 
many automatic braking systems rely on Al to stop the car based on all the 
inputs that a vehicle can provide, such as the direction of a skid. 


Machine efficiency: Al can help control a machine in such a manner as to 
obtain maximum efficiency. The Al controls the use of resources so that the 
system doesn't overshoot speed or other goals. Every ounce of power is used 
precisely as needed to provide the desired services. 


This list doesn’t even begin to scratch the surface. You can find AI used in many 
other ways. However, it’s also useful to view uses of machine learning outside the 
normal realm that many consider the domain of AI. Here are a few uses for 
machine learning that you might not associate with an AI: 


» 


Access control: In many cases, access control is a yes or no proposition. An 
employee smartcard grants access to a resource much in the same way that 
people have used keys for centuries. Some locks do offer the capability to set 
times and dates that access is allowed, but the coarse-grained control doesn't 
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really answer every need. By using machine learning, you can determine 
whether an employee should gain access to a resource based on role and 
need. For example, an employee can gain access to a training room when the 
training reflects an employee role. 


>> Animal protection: The ocean might seem large enough to allow animals 
and ships to cohabitate without problem. Unfortunately, many animals get hit 
by ships each year. A machine learning algorithm could allow ships to avoid 
animals by learning the sounds and characteristics of both the animal and the 
ship. 


>> Predicting wait times: Most people don't like waiting when they have no 
idea of how long the wait will be. Machine learning allows an application to 
determine waiting times based on staffing levels, staffing load, complexity of 
the problems the staff is trying to solve, availability of resources, and so on. 


Being useful; being mundane 


Even though the movies make it sound like AI is going to make a huge splash, and 
you do sometimes see some incredible uses for AI in real life, the fact of the matter 
is that most uses for AI are mundane, even boring. For example, a recent article 
details how Verizon uses the R language (see Chapters 4 and 5 for R language details) 
to analyze security breach data (http: //www.computerworld.com/article/ 
3001832/data—-analytics/how-verizon-analyzes-—security—breach-data- 
with-r.html). Part 5 of this book provides you with real-world examples of this 
same sort of analysis. The act of performing this analysis is dull when compared 
to other sorts of AI activities, but the benefits are that Verizon saves money per- 
forming the analysis using R, and the results are better as well. 


In addition, Python developers (see Chapters 6 and 7 for Python language details) 
have a huge array of libraries available to make machine learning easy. In fact, 
Kaggle (https: //www.kaggle.com/competitions) provides competitions to allow 
Python developers and R practitioners to hone their machine learning skills in 
creating practical applications. The results of these competitions often appear 
later as part of products that people actually use. Although R still relies on strong 
support from the statistical community in academic research, the Python devel- 
opment community is particularly busy creating new libraries to make develop- 
ment of complex data science and machine learning applications easier (see 
http://www. kdnuggets . com/2015/06/top-—20-python-machine-1learning—open- 
source-projects.html for the top 20 Python libraries in use today). 


PART 1 Introducing How Machines Learn 


Considering the Relationship between 
Al and Machine Learning 


Machine learning is only part of what a system requires to become an AI. The 
machine learning portion of the picture enables an AI to perform these tasks: 


>> Adapt to new circumstances that the original developer didn't envision 
>» Detect patterns in all sorts of data sources 
>» Create new behaviors based on the recognized patterns 


>> Make decisions based on the success or failure of these behaviors 


The use of algorithms to manipulate data is the centerpiece of machine learning. 
To prove successful, a machine learning session must use an appropriate algo- 
rithm to achieve a desired result. In addition, the data must lend itself to analysis 
using the desired algorithm, or it requires a careful preparation by scientists. 


AI encompasses many other disciplines to simulate the thought process success- 
fully. In addition to machine learning, AI normally includes 


» Natural language processing: The act of allowing language input and putting 
it into a form that a computer can use. 


>> Natural language understanding: The act of deciphering the language in 
order to act upon the meaning it provides. 


>» Knowledge representation: The ability to store information in a form that 
makes fast access possible. 


>> Planning (in the form of goal seeking): The ability to use stored information 
to draw conclusions in near real time (almost at the moment it happens, but 
with a slight delay, sometimes so short that a human won't notice, but the 
computer can). 


>» Robotics: The ability to act upon requests from a user in some physical form. 


In fact, you might be surprised to find that the number of disciplines required to 
create an AI is huge. Consequently, this book exposes you to only a portion of what 
an AI contains. However, even the machine learning portion of the picture can 
become complex because understanding the world through the data inputs that a 
computer receives is a complex task. Just think about all the decisions that you 
constantly make without thinking about them. For example, just the concept of 
seeing something and knowing whether you can interact successfully with it can 
become a complex task. 
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Considering Al and Machine Learning 
Specifications 


As scientists continue to work with a technology and turn hypotheses into theories, 
the technology becomes related more to engineering (where theories are imple- 
mented) than science (where theories are created). As the rules governing a technol- 
ogy become clearer, groups of experts work together to define these rules in written 
form. The result is specifications (a group of rules that everyone agrees upon). 


Eventually, implementations of the specifications become standards that a gov- 
erning body, such as the IEEE (Institute of Electrical and Electronics Engineers) or 
a combination of the ISO/IEC (International Organization for Standardization/ 
International Electrotechnical Commission), manages. AI and machine learning 
have both been around long enough to create specifications, but you currently 
won’t find any standards for either technology. 


The basis for machine learning is math. Algorithms determine how to interpret 
big data in specific ways. The math basics for machine learning appear in Part 3 of 
the book. You discover that algorithms process input data in specific ways and 
create predictable outputs based on the data patterns. What isn’t predictable is the 
data itself. The reason you need AI and machine learning is to decipher the data in 
such a manner to be able to see the patterns in it and make sense of them. 


You see the specifications detailed in Part 4 in the form of algorithms used to per- 
form specific tasks. When you get to Part 5, you begin to see the reason that every- 
one agrees to specific sets of rules governing the use of algorithms to perform 
tasks. The point is to use an algorithm that will best suit the data you have in hand 
to achieve the specific goals you’ve created. Professionals implement algorithms 
using languages that work best for the task. Machine learning relies on Python 
and R, and to some extent Matlab, Java, Julia, and C+. (See the discussion at 
https: //www. quora.com/What-—is—the-best-1anguage-—to-—use-whi le-learning— 
machine-learning—for-—the-first—time for details.) 





Defining the Divide between Art 
and Engineering 
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The reason that AI and machine learning are both sciences and not engineering 
disciplines is that both require some level of art to achieve good results. The artis- 
tic element of machine learning takes many forms. For example, you must con- 
sider how the data is used. Some data acts as a baseline that trains an algorithm 
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to achieve specific results. The remaining data provides the output used to under- 
stand the underlying patterns. No specific rules governing the balancing of data 
exist; the scientists working with the data must discover whether a specific bal- 
ance produces optimal output. 


Cleaning the data also lends a certain amount of artistic quality to the result. The 
manner in which a scientist prepares the data for use is important. Some tasks, 
such as removing duplicate records, occur regularly. However, a scientist may also 
choose to filter the data in some ways or look at only a subset of the data. As a 
result, the cleaned dataset used by one scientist for machine learning tasks may 
not precisely match the cleaned dataset used by another. 


You can also tune the algorithms in certain ways or refine how the algorithm 
works. Again, the idea is to create output that truly exposes the desired patterns 
so that you can make sense of the data. For example, when viewing a picture, a 
robot may have to determine which elements of the picture it can interact with 
and which elements it can’t. The answer to that question is important if the robot 
must avoid some elements to keep on track or to achieve specific goals. 


When working in a machine learning environment, you also have the problem of 
input data to consider. For example, the microphone found in one smartphone 
won’t produce precisely the same input data that a microphone in another smart- 
phone will. The characteristics of the microphones differ, yet the result of inter- 
preting the vocal commands provided by the user must remain the same. Likewise, 
environmental noise changes the input quality of the vocal command, and the 
smartphone can experience certain forms of electromagnetic interference. Clearly, 
the variables that a designer faces when creating a machine learning environment 
are both large and complex. 


The art behind the engineering is an essential part of machine learning. The expe- 
rience that a scientist gains in working through data problems is essential because 
it provides the means for the scientist to add values that make the algorithm work 
better. A finely tuned algorithm can make the difference between a robot success- 
fully threading a path through obstacles and hitting every one of them. 
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IN THIS CHAPTER 


Understanding the essentials of 
big data 


Locating big data sources 


Considering how statistics and big 
data work together in machine 
learning 


Defining the role of algorithms in 
machine learning 


Determining how training works 
with algorithms in machine learning 


Learning in the 
Age of Big Data 


omputers manage data through applications that perform tasks using algo- 

rithms of various sorts. A simple definition of an algorithm is a systematic 

set of operations to perform on a given data set — essentially a procedure. 
The four basic data operations are Create, Read, Update, and Delete (CRUD). This 
set of operations may not seem complex, but performing these essential tasks is 
the basis of everything you do with a computer. As the dataset becomes larger, the 
computer can use the algorithms found in an application to perform more work. 
The use of immense datasets, known as big data, enables a computer to perform 
work based on pattern recognition in a nondeterministic manner. In short, to 
create a computer setup that can learn, you need a dataset large enough for the 
algorithms to manage in a manner that allows for pattern recognition, and this 
pattern recognition needs to use a simple subset to make predictions (statistical 
analysis) of the dataset as a whole. 
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Big data exists in many places today. Obvious sources are online databases, such 
as those created by vendors to track consumer purchases. However, you find many 
non-obvious data sources, too, and often these non-obvious sources provide the 
greatest resources for doing something interesting. Finding appropriate sources 
of big data lets you create machine learning scenarios in which a machine can 
learn in a specified manner and produce a desired result. 


Statistics, one of the methods of machine learning that you consider in this book, 
is a method of describing problems using math. By combining big data with sta- 
tistics, you can create a machine learning environment in which the machine con- 
siders the probability of any given event. However, saying that statistics is the 
only machine learning method is incorrect. This chapter also introduces you to the 
other forms of machine learning currently in place. 


Algorithms determine how a machine interprets big data. The algorithm used to 
perform machine learning affects the outcome of the learning process and, there- 
fore, the results you get. This chapter helps you understand the five main tech- 
niques for using algorithms in machine learning. 


Before an algorithm can do much in the way of machine learning, you must train 
it. The training process modifies how the algorithm views big data. The final sec- 
tion of this chapter helps you understand that training is actually using a subset of 
the data as a method for creating the patterns that the algorithm needs to recog- 
nize specific cases from the more general cases that you provide as part of the 
training. 


Defining Big Data 
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Big data is substantially different from being just a large database. Yes, big data 
implies lots of data, but it also includes the idea of complexity and depth. A big 
data source describes something in enough detail that you can begin working with 
that data to solve problems for which general programming proves inadequate. 
For example, consider Google’s self-driving cars. The car must consider not only 
the mechanics of the car’s hardware and position with space but also the effects 
of human decisions, road conditions, environmental conditions, and other vehi- 
cles on the road. The data source contains many variables — all of which affect the 
vehicle in some way. Traditional programming might be able to crunch all the 
numbers, but not in real time. You don’t want the car to crash into a wall and have 
the computer finally decide five minutes later that the car is going to crash into a 
wall. The processing must prove timely so that the car can avoid the wall. 
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JUST HOW BIG IS BIG? 


Big data can really become quite big. For example, suppose that your Google self-driving 
car has a few HD cameras and a couple hundred sensors that provide information at a 
rate of 100 times/s. What you might end up with is a raw dataset with input that exceeds 
100 Mbps. Processing that much data is incredibly hard. 


Part of the problem right now is determining how to control big data. Currently, the 
attempt is to log everything, which produces a massive, detailed dataset. However, this 
dataset isn't well formatted, again making it quite hard to use. As this book progresses, 
you discover techniques that help control both the size and the organization of big data 
so that the data becomes useful in making predictions. 


The acquisition of big data can also prove daunting. The sheer bulk of the dataset 
isn’t the only problem to consider — also essential is to consider how the dataset 
is stored and transferred so that the system can process it. In most cases, develop- 
ers try to store the dataset in memory to allow fast processing. Using a hard drive 
to store the data would prove too costly, time-wise. 


When thinking about big data, you also consider anonymity. Big data presents 
privacy concerns. However, because of the way machine learning works, knowing 
specifics about individuals isn’t particularly helpful anyway. Machine learning is 

REMEMBER All about determining patterns — analyzing training data in such a manner that 
the trained algorithm can perform tasks that the developer didn’t originally pro- 
gram it to do. Personal data has no place in such an environment. 


Finally, big data is so large that humans can’t reasonably visualize it without help. 
Part of what defined big data as big is the fact that a human can learn something 
from it, but the sheer magnitude of the dataset makes recognition of the patterns 
impossible (or would take a really long time to accomplish). Machine learning 
helps humans make sense and use of big data. 


Considering the Sources of Big Data 


Before you can use big data for a machine learning application, you need a source 
of big data. Of course, the first thing that most developers think about is the huge, 
corporate-owned database, which could contain interesting information, but it’s 
just one source. The fact of the matter is that your corporate databases might not 
even contain particularly useful data for a specific need. The following sections 
describe locations you can use to obtain additional big data. 
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Building a new data source 


To create viable sources of big data for specific needs, you might find that you 
actually need to create a new data source. Developers built existing data sources 
around the needs of the client-server architecture in many cases, and these 
sources may not work well for machine learning scenarios because they lack the 
required depth (being optimized to save space on hard drives does have disadvan- 
tages). In addition, as you become more adept in using machine learning, you find 
that you ask questions that standard corporate databases can’t answer. With this 
in mind, the following sections describe some interesting new sources for big data. 


Obtaining data from public sources 


Governments, universities, nonprofit organizations, and other entities often 
maintain publicly available databases that you can use alone or combined with 
other databases to create big data for machine learning. For example, you can 
combine several Geographic Information Systems (GIS) to help create the big data 
required to make decisions such as where to put new stores or factories. The 
machine learning algorithm can take all sorts of information into account — 
everything from the amount of taxes you have to pay to the elevation of the land 
(which can contribute to making your store easier to see). 


The best part about using public data is that it’s usually free, even for commercial 
use (or you pay a nominal fee for it). In addition, many of the organizations that 
created them maintain these sources in nearly perfect condition because the orga- 
nization has a mandate, uses the data to attract income, or uses the data inter- 
nally. When obtaining public source data, you need to consider a number of issues 
to ensure that you actually get something useful. Here are some of the criteria you 
should think about when making a decision: 


>» The cost, if any, of using the data source 
>» The formatting of the data source 


>» Access to the data source (which means having the proper infrastructure in 
place, such as an Internet connection when using Twitter data) 


>» Permission to use the data source (some data sources are copyrighted) 


>» Potential issues in cleaning the data to make it useful for machine learning 


Obtaining data from private sources 


You can obtain data from private organizations such as Amazon and Google, both 
of which maintain immense databases that contain all sorts of useful information. 
In this case, you should expect to pay for access to the data, especially when used 
in a commercial setting. You may not be allowed to download the data to your 
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personal servers, so that restriction may affect how you use the data in a machine 
learning environment. For example, some algorithms work slower with data that 
they must access in small pieces. 


The biggest advantage of using data from a private source is that you can expect 
better consistency. The data is likely cleaner than from a public source. In addi- 
tion, you usually have access to a larger database with a greater variety of data 
types. Of course, it all depends on where you get the data. 


Creating new data from existing data 


Your existing data may not work well for machine learning scenarios, but that 
doesn’t keep you from creating a new data source using the old data as a starting 
point. For example, you might find that you have a customer database that con- 
tains all the customer orders, but the data isn’t useful for machine learning 
because it lacks tags required to group the data into specific types. One of the new 
job types that you can expect to create is people who massage data to make it bet- 
ter suited for machine learning — including the addition of specific information 
types such as tags. 


Machine learning will have a significant effect on your business. The article at 
http://www. computerworld.com/article/3007053/big—data/how-machine- 
learning-will-affect-your-business.html describes some of the ways in 
which you can expect machine learning to change how you do business. One of the 
points in this article is that machine learning typically works on 80 percent of the 
data. In 20 percent of the cases, you still need humans to take over the job of 
deciding just how to react to the data and then act upon it. The point is that 
machine learning saves money by taking over repetitious tasks that humans don’t 
really want to do in the first place (making them inefficient). However, machine 
learning doesn’t get rid of the need for humans completely, and it creates the need 
for new types of jobs that are a bit more interesting than the ones that machine 
learning has taken over. Also important to consider is that you need more humans 
at the outset until the modifications they make train the algorithm to understand 
what sorts of changes to make to the data. 





Using existing data sources 


Your organization has data hidden in all sorts of places. The problem is in recog- 
nizing the data as data. For example, you may have sensors on an assembly line 
that track how products move through the assembly process and ensure that the 
assembly line remains efficient. Those same sensors can potentially feed infor- 
mation into a machine learning scenario because they could provide inputs on 
how product movement affects customer satisfaction or the price you pay for 
postage. The idea is to discover how to create mashups that present existing data 
as a new kind of data that lets you do more to make your organization work well. 
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Big data can come from any source, even your email. A recent article discusses 
how Google uses your email to create a list of potential responses for new emails. 
(See the article athttp: //www.semrush.com/blog/deep—learning-—an-upcoming— 
gmai1-feature-that-wi1l1l-—answer-your-emails-for-you/.) Instead of having 
to respond to every email individually, you can simply select a canned response at 
the bottom of the page. This sort of automation isn’t possible without the original 
email data source. Looking for big data in specific locations will blind you to the 
big data sitting in common places that most people don’t think about as data 
sources. Tomorrow’s applications will rely on these alternative data sources, but 
to create these applications, you must begin seeing the data hidden in plain view 
today. 





Some of these applications already exist, and you’re completely unaware of them. 
The video at http: //research.microsoft.com/apps/video/default.aspx?id= 
256288 makes the presence of these kinds of applications more apparent. By the 
time you complete the video, you begin to understand that many uses of machine 
learning are already in place and users already take them for granted (or have no 
idea that the application is even present). 


Locating test data sources 


As you progress through the book, you discover the need to teach whichever algo- 
rithm you’re using (don’t worry about specific algorithms; you see a number of 
them later in the book) how to recognize various kinds of data and then to do 
something interesting with it. This training process ensures that the algorithm 
reacts correctly to the data it receives after the training is over. Of course, you also 
need to test the algorithm to determine whether the training is a success. In many 
cases, the book helps you discover ways to break a data source into training and 
testing data components in order to achieve the desired result. Then, after train- 
ing and testing, the algorithm can work with new data in real time to perform the 
tasks that you verified it can perform. 


In some cases, you might not have enough data at the outset for both training (the 
essential initial test) and testing. When this happens, you might need to create a 
test setup to generate more data, rely on data generated in real time, or create the 
test data source artificially. You can also use similar data from existing sources, 
such as a public or private database. The point is that you need both training and 
testing data that will produce a known result before you unleash your algorithm 
into the real world of working with uncertain data. 
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Specifying the Role of Statistics 
in Machine Learning 


Some sites online would have you believe that statistics and machine learning are 
two completely different technologies. For example, when you read Statistics vs. 
Machine Learning, fight! (http: //brenocon. com/blog/20@8/12/statistics—vs— 
machine-learning-fight/), you get the idea that the two technologies are not 
only different, but downright hostile toward each other. The fact is that statistics 
and machine learning have a lot in common and that statistics represents one of 
the five tribes (schools of thought) that make machine learning feasible. The five 
tribes are 


>» Symbolists: The origin of this tribe is in logic and philosophy. This group relies 
on inverse deduction to solve problems. 


>> Connectionists: The origin of this tribe is in neuroscience. This group relies 
on backpropagation to solve problems. 


>> Evolutionaries: The origin of this tribe is in evolutionary biology. This group 
relies on genetic programming to solve problems. 


>> Bayesians: This origin of this tribe is in statistics. This group relies on probabi- 
listic inference to solve problems. 


>> Analogizers: The origin of this tribe is in psychology. This group relies on 
kernel machines to solve problems. 


The ultimate goal of machine learning is to combine the technologies and strate- 
gies embraced by the five tribes to create a single algorithm (the master algorithm) 
that can learn anything. Of course, achieving that goal is a long way off. Even so, 
scientists such as Pedro Domingos (http: //homes.cs.washington.edu/~pedrod/) 
are currently working toward that goal. 


This book follows the Bayesian tribe strategy, for the most part, in that you solve 
most problems using some form of statistical analysis. You do see strategies 
embraced by other tribes described, but the main reason you begin with statistics 
is that the technology is already well established and understood. In fact, many 
elements of statistics qualify more as engineering (in which theories are imple- 
mented) than science (in which theories are created). The next section of the chap- 
ter delves deeper into the five tribes by viewing the kinds of algorithms each tribe 
uses. Understanding the role of algorithms in machine learning is essential to 
defining how machine learning works. 
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Understanding the Role of Algorithms 
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Everything in machine learning revolves around algorithms. An algorithm is a pro- 
cedure or formula used to solve a problem. The problem domain affects the kind of 
algorithm needed, but the basic premise is always the same — to solve some sort 
of problem, such as driving a car or playing dominoes. In the first case, the prob- 
lems are complex and many, but the ultimate problem is one of getting a passenger 
from one place to another without crashing the car. Likewise, the goal of playing 
dominoes is to win. The following sections discuss algorithms in more detail. 


Defining what algorithms do 


An algorithm is a kind of container. It provides a box for storing a method to solve 
a particular kind of a problem. Algorithms process data through a series of well- 
defined states. The states need not be deterministic, but the states are defined 
nonetheless. The goal is to create an output that solves a problem. In some cases, 
the algorithm receives inputs that help define the output, but the focus is always 
on the output. 


Algorithms must express the transitions between states using a well-defined and 
formal language that the computer can understand. In processing the data and solv- 
ing the problem, the algorithm defines, refines, and executes a function. The func- 
tion is always specific to the kind of problem being addressed by the algorithm. 


Considering the five main techniques 


As described in the previous section, each of the five tribes has a different tech- 
nique and strategy for solving problems that result in unique algorithms. Com- 
bining these algorithms should lead eventually to the master algorithm that will 
be able to solve any given problem. The following sections provide an overview of 
the five main algorithmic techniques. 


Symbolic reasoning 


The term inverse deduction commonly appears as induction. In symbolic reason- 
ing, deduction expands the realm of human knowledge, while induction raises the 
level of human knowledge. Induction commonly opens new fields of exploration, 
while deduction explores those fields. However, the most important consideration 
is that induction is the science portion of this type of reasoning, while deduction 
is the engineering. The two strategies work hand in hand to solve problems by 
first opening a field of potential exploration to solve the problem and then explor- 
ing that field to determine whether it does, in fact, solve it. 


PART 1 Introducing How Machines Learn 


As an example of this strategy, deduction would say that if a tree is green and that 
green trees are alive, the tree must be alive. When thinking about induction, you 
would say that the tree is green and that the tree is also alive; therefore, green 
trees are alive. Induction provides the answer to what knowledge is missing given 
a known input and output. 


Connections modelled on the brain’s neurons 


The connectionists are perhaps the most famous of the five tribes. This tribe 
strives to reproduce the brain’s functions using silicon instead of neurons. Essen- 
tially, each of the neurons (created as an algorithm that models the real-world 
counterpart) solves a small piece of the problem, and the use of many neurons in 
parallel solves the problem as a whole. 


The use of backpropagation, or backward propagation of errors, seeks to determine 
the conditions under which errors are removed from networks built to resemble the 
human neurons by changing the weights (how much a particular input figures into 
the result) and biases (which features are selected) of the network. The goal is to 
continue changing the weights and biases until such time as the actual output 
matches the target output. At this point, the artificial neuron fires and passes its 
solution along to the next neuron in line. The solution created by just one neuron 
is only part of the whole solution. Each neuron passes information to the next neu- 
ron in line until the group of neurons creates a final output. 


Evolutionary algorithms that test variation 


The evolutionaries rely on the principles of evolution to solve problems. In other 
words, this strategy is based on the survival of the fittest (removing any solutions 
that don’t match the desired output). A fitness function determines the viability of 
each function in solving a problem. 


Using a tree structure, the solution method looks for the best solution based on 
function output. The winner of each level of evolution gets to build the next-level 
functions. The idea is that the next level will get closer to solving the problem but 
may not solve it completely, which means that another level is needed. This par- 
ticular tribe relies heavily on recursion and languages that strongly support recur- 
sion to solve problems. An interesting output of this strategy has been algorithms 
that evolve: One generation of algorithms actually builds the next generation. 


Bayesian inference 


The Bayesians use various statistical methods to solve problems. Given that sta- 
tistical methods can create more than one apparently correct solution, the choice 
of a function becomes one of determining which function has the highest proba- 
bility of succeeding. For example, when using these techniques, you can accept a 
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set of symptoms as input and decide the probability that a particular disease will 
result from the symptoms as output. Given that multiple diseases have the same 
symptoms, the probability is important because a user will see some in which a 
lower probability output is actually the correct output for a given circumstance. 


Ultimately, this tribe supports the idea of never quite trusting any hypothesis (a 
result that someone has given you) completely without seeing the evidence used 
to make it (the input the other person used to make the hypothesis). Analyzing the 
evidence proves or disproves the hypothesis that it supports. Consequently, it 
isn’t possible to determine which disease someone has until you test all the symp- 
toms. One of the most recognizable outputs from this tribe is the spam filter. 


Systems that learn by analogy 


The analogyzers use kernel machines to recognize patterns in data. By recogniz- 
ing the pattern of one set of inputs and comparing it to the pattern of a known 
output, you can create a problem solution. The goal is to use similarity to deter- 
mine the best solution to a problem. It’s the kind of reasoning that determines 
that using a particular solution worked in a given circumstance at some previous 
time; therefore, using that solution for a similar set of circumstances should also 
work. One of the most recognizable outputs from this tribe is recommender sys- 
tems. For example, when you get on Amazon and buy a product, the recommender 
system comes up with other, related products that you might also want to buy. 


Defining What Training Means 
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Many people are somewhat used to the idea that applications start with a func- 
tion, accept data as input, and then provide a result. For example, a programmer 
might create a function called Add() that accepts two values as input, such as 1 
and 2. The result of Add( ) is 3. The output of this process is a value. In the past, 
writing a program meant understanding the function used to manipulate data to 
create a given result with certain inputs. 


Machine learning turns this process around. In this case, you know that you have 
inputs, such as 1 and 2. You also know that the desired result is 3. However, you 
don’t know what function to apply to create the desired result. Training provides 
a learner algorithm with all sorts of examples of the desired inputs and results 
expected from those inputs. The learner then uses this input to create a function. 
In other words, training is the process whereby the learner algorithm maps a flex- 
ible function to the data. The output is typically the probability of a certain class 
or a numeric value. 
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A single learner algorithm can learn many different things, but not every algo- 
rithm is suited for certain tasks. Some algorithms are general enough that they 
can play chess, recognize faces on Facebook, and diagnose cancer in patients. An 
algorithm reduces the data inputs and the expected results of those inputs to a 
function in every case, but the function is specific to the kind of task you want the 
algorithm to perform. 


The secret to machine learning is generalization. The goal is to generalize the 
output function so that it works on data beyond the training set. For example, 
consider a spam filter. Your dictionary contains 100,000 words (actually a small 
dictionary). A limited training dataset of 4,000 or 5,000 word combinations must 
create a generalized function that can then find spam in the 24100,000 combina- 
tions that the function will see when working with actual data. 


When viewed from this perspective, training might seem impossible and learning 
even worse. However, to create this generalized function, the learner algorithm 
relies on just three components: 


>> Representation: The learner algorithm creates a model, which is a function 
that will produce a given result for specific inputs. The representation is a set 
of models that a learner algorithm can learn. In other words, the learner 
algorithm must create a model that will produce the desired results from the 
input data. If the learner algorithm can't perform this task, it can’t learn from 
the data and the data is outside the hypothesis space of the learner algorithm. 
Part of the representation is to discover which features (data elements within 
the data source) to use for the learning process. 


>> Evaluation: The learner can create more than one model. However, it doesn’t 
know the difference between good and bad models. An evaluation function 
determines which of the models works best in creating a desired result from a 
set of inputs. The evaluation function scores the models because more than 
one model could provide the required results. 


>> Optimization: At some point, the training process produces a set of models 
that can generally output the right result for a given set of inputs. At this point, 
the training process searches through these models to determine which one 
works best. The best model is then output as the result of the training process. 


Much of this book focuses on representation. For example, in Chapter 14 you dis- 
cover how to work with the K-Nearest Neighbor (KNN) algorithm. However, the 
training process is more involved than simply choosing a representation. All three 
steps come into play when performing the training process. Fortunately, you can 
start by focusing on representation and allow the various libraries discussed in the 
book to do the rest of the work for you. 
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IN THIS CHAPTER 


Discovering how machine learning 
will help create useful future 
technologies 


Developing new kinds of work as the 
result of machine learning 


Considering how machine learning 
can create potential problems 


Having a Glance at 
the Future 


achine learning technology appears in many products today, but it isn’t 

even close to complete usability yet. The algorithms used for machine 

learning today are still relatively basic when compared to what scientists 
plan to provide for the future. In addition, the data sources for machine learning 
today are smaller than the datasets planned for future use. In short, machine 
learning is in its infancy. It already performs a considerable number of tasks amaz- 
ingly well, however. This chapter looks at what might be possible in the future. It 
helps you understand the direction that machine learning is taking and how that 
direction could help entrench machine learning into every aspect of daily life. 


One of the issues that comes with a new technology such as machine learning is a 
fear that machine learning will keep people from working. Quite the contrary: 
Machine learning will open new occupations that people should find more excit- 
ing than working on an assembly line or flipping burgers at a restaurant. One of 
the goals is to provide creative and interesting work for people to do. Of course, 
these new jobs will require more and new kinds of training before people can per- 
form them well. 


Every new technology also comes with pitfalls. It’s a cliché but true that it’s easier 
to destroy than to create. The potential pitfalls of machine learning need to be 
taken seriously. Because this technology is in its infancy, now is the time to con- 
sider the potential pitfalls and do something about them before they materialize. 
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The last section of the chapter doesn’t discuss problems that will happen; rather, 
it discusses problems that could happen and that people can certainly avoid by 
using the technology correctly. 


Creating Useful Technologies 
for the Future 


36 


To survive, a technology must prove useful. In fact, it must prove more than use- 
ful; it must meet perceived needs in a manner that existing technologies don’t as 
well as build a base of adherents who provide a monetary reason to continue 
investing in the technology. For example, the Apple Lisa was an interesting and 
useful piece of technology that demonstrated the usefulness of the GUI to busi- 
ness users who had never seen one before. It solved the need to make computers 
friendly. However, it failed because it didn’t build a base of adherents. The com- 
puter simply failed to live up to the hype surrounding it (see http://www. 
fastcompany .com/3041272/the-recommender/remember-—apples—lisa—the 








computer-that-cost-steve-jobs—a-gig-kevin-costner-d for details). The 
next system that Apple built, the Macintosh, did live up to the hype a bit better — 
yet it built on the same technology that the Lisa used. The difference is that the 
Macintosh developed a considerable array of hard-core adherents. 


Machine learning solves a considerable number of problems in a manner that 
other technologies would have a hard time copying. However, to become the 
must-have technology that everyone wants to invest in, machine learning must 
build that cadre of hard-core adherents. Machine learning already has some 
adherents, and you might already be one of them, but the technology must move 
into mainstream computing to make it something that everyone must have. The 
following sections discuss some of the ways in which machine learning is already 
affecting you personally and how this use will increase in the future — making 
machine learning a must-have technology. 


Considering the role of machine 
learning in robots 


A goal of machine learning today is to create useful, in-home robots. Now, you might 
be thinking of something along the lines of Rosie the robot found in The Jetsons (see 
http: //thejetsons.wikia.com/wiki/Rosie for a quick description). However, 
real-world robots need to solve practical and important problems to attract atten- 
tion. To become viable and attract funding, a technology must also amass a group of 
followers, and to do that, it must provide both interaction and ownership. 
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An example of a successful in-home robot is the Roomba from iRobot (http: // 
www. amazon. com/exec/obidos/ASIN/B@Q5GK3I1VW/datacservip@f-2@/). You can 
actually buy a Roomba today; it serves a useful purpose; and it has attracted 
enough attention to make it a viable technology. The Roomba also shows what is 
doable at a commercial, in-home, and autonomous level today. Yes, the Roomba 
is a fancy vacuum cleaner — one with built-in smarts based on simple but very 
effective algorithms. The Roomba can successfully navigate a home, which is a lot 
harder than you might think to accomplish. It can also spend more time on dirtier 
areas of the home. However, you still need to empty the Roomba when full; cur- 
rent robot technology does only so much. 


You can find other real-world robots that people are using to perform specialized 
tasks, but you won’t find them in your home. The article at http: //mashable. 
com/2010/10/10/10-amazing-real-life-robots/ talks about ten such robots. 
In each case, the robot has a specialized purpose and acts in a limited number of 
ways. Other sites present other robots, but you won’t find general-purpose uses 
in any of them. Before robots can enter a home and work as a generalized helper, 
machine learning needs to solve a wealth of problems, and the algorithms need to 
become both more generalized and deeper thinking. By now, you should see that 
robots will become a part of daily life, but it won’t happen right away (or perhaps 
even in the near future). 


Using machine learning in health care 


An issue that is receiving a lot of attention is the matter of elder care. People are 
living longer, and a nursing home doesn’t seem like a good way to spend one’s twi- 
light years. Robots will make it possible for people to remain at home yet also remain 
safe. Some countries are also facing a critical shortage of health care workers, and 
Japan is one. As a result, the country is spending considerable resources to solve the 
problems that robotics present. (Read the story at http: //magazine.good.is/ 
articles/robots—elder-—care—pepper—exoskeletons-japan for details.) 





The closest that technology currently comes to the vision presented by an in- 
home nurse robot is the telepresence robot (see http: //www.huffingtonpost. 
com/2013/11/17/robots—let-—doctors—beam_n_4291632.htm1 for details). In this 
case, the robot is an extension of a human doctor, so it’s not even close to what the 
Japanese hope to create in the near future. Like the Roomba, this robot can success- 
fully navigate a home. It also allows the doctor to see and hear the patient. The robot 
partially solves the problem of too many patients in too large a geographical area and 
not enough doctors, but it’s still a long way off from an autonomous solution. 


Creating smart systems for various needs 


Many of the solutions you can expect to see that employ machine learning will be 
assistants to humans. They perform various tasks extremely well, but these tasks 
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are mundane and repetitive in nature. For example, you might need to find a res- 
taurant to satisfy the needs of an out-of-town guest. You can waste time looking for 
an appropriate restaurant yourself, or you can access an AI to do it in far less time, 
with greater accuracy and efficiency. Chapter 1 discusses such solutions in the form 
of Siri. Another such solution is Nara (http: //www.news.com.au/technology/ 
innovation/meet-your-arti ficial-brain-the-algorithm-redefining-the- 
web/news-story/6a9eb73d f016254a65d96426e7dd59b4), an experimental AI that 
learns your particular likes and dislikes as you spend more time with it. Unlike 
Siri, which can answer basic questions, Nara goes a step further and makes rec- 
ommendations. Chapter 21 spends a lot more time discussing this particular need. 


Using machine learning in industrial 
settings 

Machine learning is already playing an important part in industrial settings where 
the focus is on efficiency. Doing things faster, more accurately, and with fewer 
resources helps the bottom line and makes an organization more flexible with a 
higher profit margin. Fewer mistakes also help the humans working in an organi- 


zation by reducing the frustration level. You can currently see machine learning at 
work in 


>> Medical diagnosis 

» Data mining 

>> Bioinformatics 

>> Speech and handwriting recognition 

>> Product categorization 

>> Inertial Measurement Unit (IMU) (such as motion capture technology) 


>> Information retrieval 


This list just scratches the surface. Machine learning is used a lot in industry 
today, and the number of uses will continue to increase as advanced algorithms 
make higher levels of learning possible. Currently, machine learning performs 
tasks in a number of areas that include the following: 


>» Analyzation: Determining what a user wants and why, and what sort of 
patterns (behaviors, associations, responses, and so on) the user exhibits 
when obtaining it. 


>> Enrichment: Adding ads, widgets, and other features to an environment so 
that the user and organization can obtain additional benefits, such as 
increased productivity or improved sales. 
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>» Adaptation: Modifying a presentation so that it reflects user tastes and 
choice of enrichment. Each user ends up with a customized experience that 
reduces frustration and improves productivity. 


>> Optimization: Modifying the environment so that the presentation consumes 
fewer resources without diminishing the user experience. 


>> Control: Steering the user to a particular course of action based on inputs 
and the highest probability of success. 


A theoretical view of what machine learning does in industry is nice, but it’s 
important to see how some of this works in the real world. You can see machine 
learning used in relatively mundane but important ways. For example, machine 
learning has a role in automating employee access, protecting animals, predicting 
emergency room wait times, identifying heart failure, predicting strokes and 
heart attacks, and predicting hospital readmissions. (The story at http://www. 
forbes.com/sites/85broads/2014/01/06/six—novel-—machine-learning- 
applications/ provides details on each of these uses.) 


Understanding the role of updated 
processors and other hardware 


The “Specifying the Role of Statistics in Machine Learning” section of Chapter 2 
tells you about the five schools of thought (tribes) related to machine learning. Each 
of these schools of thought tell you that the current computer hardware isn’t quite 
up to the task of making machine learning work properly. For example, you might 
talk to one tribe whose members tell you of the need for larger amounts of system 
memory and the use of GPUs to provide faster computations. Another tribe might 
espouse the creation of new types of processors. Learning processors, those that 
mimic the human brain, are all the rage for the connectionists. You can read about 
one such processor at http: //www. pddnet . com/news/2015/09/neuromemristive— 
processor—breaks-—boundar ies-machine-learning. The point is that everyone 
agrees that some sort of new hardware will make machine learning easier, but the 
precise form this hardware will take remains to be seen. 


Discovering the New Work Opportunities 
with Machine Learning 


You can find more than a few articles that discuss the loss of jobs that machine 
learning and its associated technologies will cause. Robots already perform a 
number of tasks that used to employ humans, and this usage will increase 
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over time. The previous section of this chapter aided you in understanding some 
of the practical, real-world uses for machine learning today and helped you dis- 
cover where those uses are likely to expand in the future. While reading this sec- 
tion, you must have also considered how those new uses could potentially cost you 
or a loved one a job. Some authors have gone so far as to say that the future might 
hold a scenario in which learning new skills might not guarantee a job (see 
http://www. computerwor 1d.com/article/3007396/it-careers/imagining—a-future— 
where-new-skills—are-no-help.html1 for details). 





The fact of the matter is that deciding just how machine learning will affect the 
work environment is hard, just as it was hard for people to see where the indus- 
trial revolution would take us in the way of mass-producing goods for the general 
consumer (see http://www.history.com/topics/industrial-revolution for 
details). Just as those workers needed to find new jobs, so people facing loss of 
occupation to machine learning today will need to find new jobs. 


Working for a machine 


It’s entirely possible that you’ll find yourself working for a machine in the future. 
In fact, you might already work for a machine and not know it. Some companies 
already use machine learning to analyze business processes and make them more 
efficient. For example, Hitachi currently uses such a setup in middle management 
(see the article at http: //www. hitachi .com/New/cnews/month/2015/09/150904. 
html). In this case, the AI actually issues the work orders based on its analysis of 
the workflow — just as a human middle manager might do. The difference is that 
the AI is actually eight percent more efficient than the humans it replaces. In 
another case, Amazon ran a contest among machine learning experts to figure out 
whether the company could better process employee authorization processes 
automatically using machine learning (http: //kaggle.com/c/amazon-emp loyee- 
access-challenge). Again, the point was to figure out how to replace middle 
management and cut a bit of the red tape. 


However, a job opportunity also presents itself. Workers under the AI do perform 
the tasks that the AI tells them to do, but they can use their own experience and 
creativity in determining how to perform the task. The AI analyzes the processes 
that the human workers use and measures the results achieved. Any successful 
processes get added into the database of techniques that workers can apply to 
accomplish tasks. In other words, the humans are teaching the AI new techniques 
to make the work environment even more efficient. 


This is an example of how machine learning can free humans from the drudgery of 
the work environment. When using human middle managers, new processes often 
get buried in a bureaucracy of unspoken rules and ego. The AI middle manager is 
designed to learn new techniques without bias, so the humans are encouraged to 
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exercise their creativity, and everyone benefits. In short, the AI, which lacks an ego 
to bruise, is the approachable manager that many workers have wanted all along. 


Working with machines 


People already work with machines on a regular basis — they may just not realize 
it. For example, when you talk to your smartphone and it recognizes what you say, 
you’re working with a machine to achieve a desired goal. Most people recognize 
that the voice interaction provided with a smartphone improves with time — the 
more you use it, the better it gets at recognizing your voice. As the learner algo- 
rithm becomes better tuned, it becomes more efficient at recognizing your voice 
and obtaining the desired result. This trend will continue. 


However, machine learning is used in all sorts of ways that might not occur to 
you. When you point a camera at a subject and the camera can put a box around 
the face (to help target the picture), you’re seeing the result of machine learning. 
The camera is helping you perform the job of taking a picture with far greater 
efficiency. In addition, the camera automatically removes at least some of the 
effects of shaking and bad lighting. Cameras have become quite good at assisting 
humans to perform tasks with aplomb. 


The use of declarative languages, such as SQL (Structured Query Language), will 
become more pronounced as well because machine learning will make advance- 
ments possible. In some respects, a declarative language simply lets you describe 
what you want and not how to obtain it. However, SQL still requires a computer 
scientist, data scientist, database administrator, or some other professional to use. 
Future languages won’t have this limitation. Eventually, someone who is trained 
to perform a particular task well will simply tell the robot assistant what to do and 
the robot assistant will discover the means to do it. Humans will use creativity to 
discover what to do; the details (the how) will become the domain of machines. 


Repairing machines 


Most of this chapter discusses current technology, where the technology will go in 
the future, and why things work as they do. However, notice that the discussion 
always focuses on the technology doing something. That’s right, before the technol- 
ogy can do anything else, it must perform a practical task that will attract attention 
and benefit humans in a manner that makes people want to have the technology for 
their own. It doesn’t matter what the technology is. Eventually, the technology will 
break. Getting the technology to do something useful is the prime consideration now, 
and the culmination of any dreams of what the technology will eventually do stretches 
years into the future, so mundane things like repairing the technology will still fall on 
human shoulders. Even if the human isn’t directly involved with the physical repair, 
human intelligence will direct the repair operation. 
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TIP 


Some articles that you read online might make you believe that self-repairing 
robots are already a reality. For example, the International Space Station robots, 
Dextre and Canadarm, performed a repair of a faulty camera (see the story at 
http://space.io9.com/a-self-repairing-space-robot-on-the-international- 
space-1580869685). What the stories don’t say is that a human decided how to 
perform the task and directed the robots to do the physical labor. Autonomous 
repair isn’t possible with the algorithms available today. Repairs often bedevil 
human experts, and until robots replicate the knowledge and skills of these 
experts, the same repairs will remain impossible for them. 





Creating new machine learning tasks 


Machine learning algorithms aren’t creative, which means that humans must 
provide the creativity that improves machine learning. Even algorithms that build 
other algorithms only improve the efficiency and accuracy of the results that the 
algorithm achieves — they can’t create algorithms that perform new kinds of 
tasks. Humans must provide the necessary input to define these tasks and the 
processes needed to begin solving them. 


You may think that only experts in machine learning will create new machine 
learning tasks. However, the story about the middle manager from Hitachi dis- 
cussed in the “Working for a machine” section, earlier in this chapter, should tell 
you that things will work differently than that. Yes, experts will help form the 
basis for defining how to solve the task, but the actual creation of tasks will come 
from people who know a particular industry best. The Hitachi story serves as a 
basis for understanding both that the future will see people from all walks of life 
contributing toward machine learning scenarios and that a specific education 
might not even help in defining new tasks. 


Devising new machine learning 
environments 


At the moment, devising new machine learning environments is the realm of 
research and development companies. A group of highly trained specialists must 
create the parameters for a new environment. For example, NASA needs robots 
to explore Mars. In this case, NASA relyies on the skills of people at MIT and 
Northeastern to perform the task (see the story at http://www. computerworld. 
com/article/3007393/robotics/nasa—needs-robotic—upgrades- for-—work— 
on-mars .htm1). Given that the robot will need to perform tasks autonomously, the 
machine learning algorithms will become quite complex and include several levels 
of problem solving. 
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Eventually, someone will be able to describe a problem in sufficient detail that a 
specialized program can create the necessary algorithm using an appropriate 
language. In other words, average people will eventually begin creating new 
machine learning environments based on ideas they have and want to try. As with 
creating machine learning tasks, people who create future environments will be 
experts in their particular craft, rather than be computer scientists or data scien- 
tists. Solving the science of machine learning will eventually turn into an engi- 
neering exercise that will give anyone with a good idea the required access. 


Avoiding the Potential Pitfalls of 
Future Technologies 


Any new technology comes with potential pitfalls. The higher the expectations for 
that technology, the more severe the pitfalls become. Unrealistic expectations 
cause all sorts of problems with machine learning because people think that what 
they see in movies is what they’ll get in the real world. It’s essential to remember 
the basic concepts presented in Chapter 1 — that machine learning algorithms 
currently can’t feel, think independently, or create anything. Unlike those movie 
Als, a machine learning algorithm does precisely what you expect it to do, nothing 
more. Of course, some of the results are amazing, but keeping expectations in line 
with what the actual technology can do is important. Otherwise, you’ll promise 
something that the technology can never deliver, and those adherents whom you 
were expecting will go looking for the next big thing. 


In fact, the uses for machine learning today are quite narrow. As described in 
the article at https: //www. linkedin. com/pulse/machine-learning-its-hard-— 
problems—valuable-toby—coppel, narrow AI, such as the business use of AI to 
gain insights into huge datasets, relies on well-understood techniques that com- 
panies have started to employ recently. The machine can’t infer anything, which 
limits the use of the machine to the task for which the developer or data scientist 
designed it. In fact, a good analogy for today’s algorithms is that they’re like a 
tailored shirt (see the article at http://www.computerworld.com/article/ 
3006525/cloud-computing/why-microsofts-—data-chie f-thinks—machine- 
learning—tools—are-1ike-tailored-shirts.html for more details). You need 
specialized skills to create an algorithm that is tailored to meet specific needs 
today, but the future could see algorithms that can tackle nearly any task. Compa- 
nies that rely on narrow AI need to exercise care in how they develop products or 
services. A change in product or service offerings might place the data used for the 
machine learning environment outside the learner algorithm’s domain, reducing 
the output of the machine learning algorithm to gibberish (or at least making it 
unreliable). 








CHAPTER 3 Having a Glance at the Future 43 


44 


ON THE WEB 


STARTS AND STOPS IN Al USAGE 


Al has had quite a few stops and starts. During its less intense periods, mainly from 1974 
to 1980 and again from 1987 to 1993 (plus several smaller episodes), Al experienced 
what has been called an A/ winter, when investors have withdrawn funds from previous 
investments in Al and machine learning because of disappointments in the practical 
application of the technology. Part of the problem was the hype created by the press. 


For instance, in 1966, investors expressed disappointment over the failure of machine 
translation. Originally, everyone expected translation to work well within in a few years. 
Translation is now a reality thanks to Google and Microsoft after about 50 years. DARPA 
and private companies reduced investment in the 1970s and 1980s because the algo- 
rithms of the time didn’t match the high expectations that mainly nonexperts formulated. 


Thanks to the possible advancements proposed by key publications of eminent scholars 
and experts like Prof. Geoffrey Hinton, Al winter ended every time, creating an opening 
to a new spring in research and development of Al technologies. However, new winters 
may be ahead of us. Though the hype cycles of a technology may be great because they 
provide needed funds, practical and fact-based evaluation should conduct expectations 
for new complex and revolutionary technologies such as Al and machine learning. 


Deep learning, a technique that allows the machine to choose inputs by using mul- 
tiple processing layers to analyze complex structures, is the next step in the pro- 
cess of employing AI in practical ways, but few companies currently employ it and 
it requires a huge amount of computing power to perform practical tasks. The 
Deep Learning site (http: //deeplearning.net/) provides you with resources for 
working with deep learning in case you want to do so. This site offers articles, 
tutorials, example code, and other information that you need to understand deep 
learning better. Although you might think that everyone uses deep learning today, 
it’s actually a future technology that a few companies choose to embrace even 
though it isn’t perfected (or anywhere near perfected) at this point. 


Using machine learning in an organization also requires that you hire people with 
the right set of skills and create a team. Machine learning in the corporate envi- 
ronment, where results mean an improvement in the bottom line, is relatively 
new. Companies face challenges in getting the right team together, developing a 
reasonable set of goals, and then actually accomplishing those goals. To attract a 
world-class team, your company has to offer a problem that’s exciting enough to 
entice the people needed from other organizations. It isn’t an easy task, and you 
need to think about it as part of defining the goals for creating a machine learning 
environment. 
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Installing an R 
Distribution 
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REMEMBER 


his book relies on two languages to demonstrate machine learning. Of 

course, many other languages are available to use, but these two languages, 

R and Python, provide the best resources and capabilities for studying and 
working with machine learning examples. In addition, industry uses these two 
languages the most as well. The first section of this chapter discusses some of the 
R distributions you might consider using to meet specific machine learning needs. 
The chapter also helps you understand why the book uses the particular R distri- 
bution that it does. 


In addition, you find out how to install R on your system in this chapter (assuming 
that you use Linux, Mac OS X, or Windows as a platform). 


The next chapter gives you some basic information on how R works to make 
following the examples easier. This book can’t provide you with a full-fledged R 
tutorial (another whole book would be needed to perform that task), so the next 
chapter is more of an overview designed to help you get started quickly. 


Chapters 6 and 7 help you install and obtain an overview of Python. You need both 
languages to complete all the examples in the book because the examples rely on 
the best language for the task. In other words, by following the examples in this 
book, you not only discover how to perform machine learning tasks but also gain 
insights into which language works best for a particular task. 
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Even if you choose not to install R using the procedures in this chapter, you do 
need to follow the procedures found in the last section of the chapter. These pro- 
cedures help you to install the example source code and datasets used in this book. 
If you don’t install these resources, you’ ll find it much more difficult to follow the 
text. Using the downloadable source code is a great idea because it helps you focus 
on discovering how machine learning works, rather than waste time writing 
error-free code. You can always recreate the examples manually later to practice 
your new-found skills, but using the downloadable source code (found at http: // 
www. dummies.com/extras/machinelearning) during the first pass will greatly 
enhance the learning experience. 


Choosing an R Distribution with Machine 
Learning in Mind 
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WARNING 


R is a combination of an environment and a language. It’s a form of the S pro- 
gramming language, which John Chambers originally created at Bell Laboratories 
to make working with statistics easier. Rick Becker and Allan Wilks eventually 
added to the S programming language as well. The goal of the R language is to 
turn ideas into software quickly and easily. In other words, R is a language 
designed to help someone who doesn’t have much programming experience create 
code without a huge learning curve. This book uses R instead of S because R is a 
freely downloadable product that can run most S code without modification; in 
contrast, you have to pay for S. Given the examples used in the book, R is a great 
choice. You can read more about R in general at https: //www.r-project.org/ 
about .html. 


You don’t want to make sweeping generalizations about the languages used for 
machine learning. Both R and Python are popular languages for different reasons. 
Articles such as “In data science, the R language is swallowing Python” (http: // 
www. infoworld.com/article/2951 779/app1ication-development/in-data- 
science-the-r-language-is-swallowing-python.htm1) initially seem to say 
that R is becoming more popular for some reason. The author wisely backs away 
from this statement by pointing out that R is best used for statistical purposes and 
Python is a better general-purpose language. The best developers always have an 
assortment of programming tools in their tool belts to make performing tasks 
easier. Languages address developer needs, so you need to use the right language 
for the job. After all, all languages ultimately become machine code that a proces- 
sor understands — a language that few developers understand any longer because 
high-level programming languages make development easier. 


PART 2 Preparing Your Learning Tools 


You can get a basic copy of R from the Comprehensive R Archive Network (CRAN) 
site at https: //cran.r-project.org/. The site provides both source code ver- 
sions and compiled versions of the R distribution for various platforms. Unless 
ontHewes you plan to make your own changes to the basic R support or want to delve into 
how R works, getting the compiled version is always better. If you use RStudio, as 
suggested in the next paragraph, you must also download and install a copy of R. 


This book uses the Desktop version of RStudio (https: //www.rstudio.com/ 
products/rstudio/#Desktop) to make the task of working with R even easier. 
This product is a free download, and you can get it in Linux (Debian/Ubuntu, 
RedHat/CentOS, and SUSE Linux), Mac, and Windows versions. The book doesn’t 
use the advanced features found in the paid version of the product, nor will you 
require the RStudio Server features. 














You can try other R distributions if you find that you don’t like RStudio. The most 
common alternative distributions are StatET (http: //www.walware.de/goto/ 
statet), Red-R (http://decisionstats.com/2010/09/28/red-r-1-8-groovy-— 
gui/ or http: //www.red-r.org/), and Rattle (http: //rattle.togaware.com/). 
All of them are good products, but RStudio appears to have the strongest following 
and is the simplest product to use. You can read discussions about the various 
choices online at places such as https: //www.quora.com/What-—are—the—best— 
choices-—for-—an-R-IDE. If you use an alternative distribution, the screenshots in 
the book won’t match what you see onscreen, and the downloadable source code 
files may not load without error (but should still work with minor touchups). 


Installing R on Windows 


Windows doesn’t come with R installed. Before you can install RStudio, you must 
install a copy of R on your system. RStudio comes with a graphical installation 
application for Windows, so getting a good install means using a wizard, much as 
you would for any other installation. Of course, you need a copy of the installation 
file before you begin, and you can find the required download information in the 
“Choosing an R Distribution with Machine Learning in Mind” section, earlier in 
this chapter. The following procedure should work fine on any Windows system: 


1. Locate the downloaded copy of R on your system. 


The name of this file varies, but it normally appears as R-3.2.3-win.exe. The 
version number appears as part of the filename. In this case, the filename tells 
you that you've downloaded version 3.2.3, which is the version of R used for 
this book. Using a different version of R could cause problems with the 
downloadable source code. 
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2. Double-click the installation file. 


You see the Select Setup Language dialog box, shown in Figure 4-1. 


Select Setup Language 
Ga] Select the language to use during the 


FIGURE 4-1: 
Choose a (Engish 
language to use 
for the R 
installation script. 











3. Choosea language from the list and then click OK. 


You see the Setup dialog box, shown in Figure 4-2. The content of this dialog 
box differs with the version of R you install. However, the main consideration is 
that you install version 3.2.3, as shown in the figure. 


15! Setup - R for Windows 3.2.3 tela es) 


Welcome to the R for Windows 
3.2.3 Setup Wizard 
This will install R for Windows 3.2.3 on your computer. 


Itis recommended that you dose all other applications before 
continuing. 


Click Next to continue, or Cancel to exit Setup. 


FIGURE 4-2: 
Verify that you 
install the correct 
version of R. 





4. Click Next. 


You see a licensing information dialog box. 
5. Read the licensing information and then click Next. 


The wizard asks where you want to install R, as shown in Figure 4-3. The book 
assumes that you use the default installation location. If you use another 
location, you may need to modify procedures later in the book to work with 
your setup. 
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FIGURE 4-3: 
Choose an 
installation 
location for R. 


Select Destination Location 
Where should R for Windows 3.2.3 be installed? 


Setup will install R for Windows 3.2.3 into the following folder. 


To continue, click Next. If you would like to select a different folder, dick Browse. 


C:\Program Files\R\R-3.2.3] Browse... 





Atleast 1.2 MB of free disk space is required. 





6. Choose an installation location (if necessary) and click Next. 


FIGURE 4-4: 
Determine which 
components to 
install. 





The wizard asks which components you want to install, as shown in Figure 4-4. 
Given that the components don't consume a lot of space, installing everything 
is best to ensure that you have a complete installation. 


Select Components 
Which components should be installed? 


Select the components you want to install; dear the components you do not want to 
install. Click Next when you are ready to continue. 





User installation 





[F] Core Files 
[V] 32-bit Files 
[F] 64-bit Files 
[7] Message translations 


Current selection requires at least 150.3 MB of disk space. 


< Back 


Modify the list of selected components (if necessary) by selecting or 
deselecting the check boxes; then click Next. 


The wizard asks whether you want to customize the startup options, as shown 


in Figure 4-5. Modifying the startup options requires advanced knowledge of 
how R works. The default options will work fine for this book. 
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FIGURE 4-5: 

Use the default 
startup options 
unless you have 
advanced 
knowledge of R. 


FIGURE 4-6: 
Define the Start 
menu informa- 
tion for Ras 
needed. 





J 


8. 


Setup - R for Windows 3.2.3 

Startup options 
Do you want to customize the startup options? 
Please specify yes or no, then dick Next. 


Yes (customized startup) 
©) No (accept defaults) 





Select the No option and click Next. 


The wizard asks where you want to place the R icons (the “program shortcuts”) 
on the Start menu, as shown in Figure 4-6. This book assumes that you use the 
default Start menu setup. If you choose some other location, you may need to 
modify procedures later in the book. 





r] 





Setup - R for Windows 3.2.3 
Select Start Menu Folder 
Where should Setup place the program's shortcuts? R 


| = ' Setup will create the program's shortcuts in the following Start Menu folder. 


To continue, dick Next. If you would like to select a different folder, dick Browse. 


R Browse... 





[E] Don't create a Start Menu folder 


Taa Coa 


Choose Start Menu Configuration (if necessary) and then click Next. 


The wizard asks you to configure any additional script configuration options, as 
shown in Figure 4-7. Make certain that you keep the Registry Entries options 
selected, as shown in the figure. Otherwise, you'll encounter problems when 
configuring RStudio for use. 
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FIGURE 4-7: 
Modify the 
additional tasks 
as needed. 





18! Setup - R for Windows 3.2.3 Eleea 


Select Additional Tasks 
Which additional tasks should be performed? R 


Select the additional tasks you would like Setup to perform while installing R for 
Windows 3.2.3, then dick Next. 


Additional icons: 
[V] Create a desktop icon 
[E] Create a Quick Launch icon 
Registry entries: 
[V] Save version number in registry 
[V] Associate R with .RData files 


ae Gea 


10. Modify the selected additional tasks as needed and click Next. 


You see an Installing dialog box that tells you about the installation progress. 
After the process completes, you see a completion dialog box appear in its 
place. 


11. Click Finish. 


You're now ready to install RStudio so that you can use R with greater ease. 


12. Locate the downloaded copy of RStudio on your system. 


The name of this file varies, but normally it appears as RStudio-@.99. 491. 
exe. The version number is embedded as part of the filename. In this case, the 
filename refers to version 0.99.491, which is the version used for this book. If 
you use some other version, you may experience problems with the source 
code and need to make adjustments when working with it. 


1 3. Double-click the installation file. 


(You may see an Open File - Security Warning dialog box that asks whether 
you want to run this file. Click Run if you see this dialog box pop up.) You see 
an RStudio Setup dialog box similar to the one shown in Figure 4-8. The exact 
dialog box that you see depends on which version of the RStudio installation 
program you download. 
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FIGURE 4-8: 
The setup 


process begins by 
telling you that 
you're installing 
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RStudio. 


FIGURE 4-9: 
Specify an 
installation 
location for 
RStudio. 








=a) 


Welcome to the RStudio Setup 
Wizard 


This wizard will guide you through the installation of RStudio. 
It is recommended that you close all other applications 
before starting Setup. This will make it possible to update 


relevant system files without having to reboot your 
computer. 


Click Next to continue. 


14. Click Next. 


The wizard asks where to install RStudio on disk, as shown in Figure 4-9. The 
book assumes that you use the default location. If you choose some other 
location, you may have to modify some procedures later in the book to work 
with your setup. 


t> RStudio Setup 


Choose the folder in which to install RStudio. 


Setup will install RStudio in the following folder. To install in a different folder, dick Browse 
and select another folder. Click Next to continue. 


Destination Folder 





C:\Program Files \RStudio 


Space required: 407. 1MB 
Space available: 317.8GB 


1 5. Choose an installation location (if necessary) and then click Next. 


The wizard asks where you want to place the RStudio icons (shortcuts) in the 
Start menu, as shown in Figure 4-10. This book assumes that you use the 
default Start menu setup. If you choose some other location, you may need to 
modify procedures later in the book. 
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FIGURE 4-10: 
Define the Start 
menu informa- 
tion as needed. 


t> RStudio Setup bolo E 
J Choose Start Menu Folder 
Choose a Start Menu folder for the RStudio shortcuts. 


Select the Start Menu folder in which you would like to create the program's shortcuts. You 
can also enter a name to create a new folder. 





RStudio| 


Accessories 
Administrative Tools 
Anaconda2 (64-bit) 
AVG 

CodeBlocks 
Communication 
Free Editor 

Games 

Graphics 

Java 

Java Development Kit 


Do not create shortcuts 





16. Choose Start Menu Configuration (if necessary) and then click Install. 


You see an Installing dialog box that tells you about the installation progress. 
After the process completes, you see a completion dialog box appear in its 
place. 


17. Click Finish. 


You're ready to begin using RStudio. 


A WORD ABOUT THE SCREENSHOTS 


As you work your way through the book, you'll use an IDE of your choice to open the R 
files containing the book's source code. Every screenshot that contains IDE-specific 
information relies on RStudio because RStudio runs on all three platforms supported by 
the book. The use of RStudio doesn't imply that it's the best IDE or that the author is 
making any sort of recommendation for it — RStudio simply works well as a demonstra- 
tion product. 


When you work with RStudio, the name of the graphical (GUI) environment, the inter- 
face, is precisely the same across all three platforms, and you won't even see any signifi- 
cant difference in the presentation. The differences you do see are minor, and you 
should ignore them as you work through the book. With this in mind, the book does rely 
heavily on Windows 7 screenshots. When working on a Linux, Mac OS X, or other 
Windows version platform, you should expect to see some differences in presentation, 
but these differences shouldn't reduce your ability to work with the examples. 
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Installing R on Linux 
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TIP 


Some versions of Linux, such as Debian, come with R installed. If this is the case, 
and you want to use the version you have on your system, you don’t have to install 
a copy of R. The version of R used for this book is 3.2.3. If you have some other 
version of R installed, you may not get precisely the same results as described in 
the book. To determine whether you have R installed, use your platform’s search 
tool, such as Dash, to locate R. When you start R, you see the version information 
in the first line of Terminal. If you can’t find an R icon anywhere on the system, 
you need to install R. The directions vary significantly by the kind of Linux instal- 
lation you have. You can find the instructions for the various Linux installations 
at the following locations: 


>> Debian: https: //cran.r—-project.org/bin/linux/debian/ 
>» Red Hat: https: //cran.r—project.org/bin/linux/redhat/README 
>> OpenSUSE: https: //cran.r—-project.org/bin/1linux/suse/ 


>> Ubuntu: https: //cran.r—-project.org/bin/linux/ubuntu/ 


When working with OpenSUSE, you’re highly recommended to use the 1-click- 
install option. This option automates the task and makes installation significantly 
easier. The resulting installation works fine for this book. 


The Ubuntu installation requires that you make a change to the /etc/apt/ 
sources. list file. Make sure to create a backup for this file first by using the sudo 
cp /etc/apt/sources.list /etc/apt/sources.list.old command first. Use 
the sudo gedit /etc/apt/sources.list command to edit the file and add the 
required source information. 


After you ensure that you have R installed on your system, you can install a copy 
of RStudio Desktop. Make sure that you download the correct version of RStudio 
for your version of Linux. You can download either a 32-bit or 64-bit version of 
RStudio. Using the 64-bit version is highly recommended for improved perfor- 
mance. The following steps help you install RStudio: 


1. Locate the downloaded copy of RStudio on your system. 


The name of this file varies, but normally it appears as something like 
rstudio-@.99. 491-1386. deb (32-bit) or rstudio-@.99.491—amd64. deb 
(64-bit). The version number is embedded as part of the filename. In this case, 
the filename refers to version 0.99.491, which is the version used for this book. 
If you use some other version, you may experience problems with the down- 
loadable source code and need to make adjustments when working with it. 
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2. Open the installation file. 


Depending on your version of Linux, you see an installation utility, such as the 
Ubuntu Software Center when using Ubuntu. This window contains options for 
selecting the installation language and may contain other options as well. 


3. Click install (or a similar button, depending on your Linux version). 


You see an Installing or similar message. A progress bar shows the installation 
progress. Depending on your version of Linux, the installer may ask you to 
authenticate yourself by entering your password and pressing Enter. At some 
point, the message will change to Installed. 


4. Close the installation utility window. 
You're ready to begin using RStudio. 
Even after a successful installation, you may find that RStudio refuses to run. The 
Se, icon is there but nothing happens when you click it. The problem is likely one of 
not having a required dependency installed. In most cases, opening a copy of Ter- 
TECHNICAL minal, typing sudo rstudio, and pressing Enter will tell you the problem. The most 
common problem is missing the libgstapp-0.10.s0.0 library. These steps help you 
fix the problem: 


1. Type sudo apt-get install libgstreamer0.10-0 in Terminal and press Enter. 


Terminal may ask you to authenticate yourself. Afterward, sudo installs the 
required library for you. Simply follow the prompts as needed and answer yes 
when asked if you want to install the library and use hard drive space to 
support it. You must also install the plug-ins for this library. 


2. Type sudo apt-get install libgstreamer-plugins-base0.10-dev and press Enter. 


After a few moments, sudo installs the required plug-ins for you. Simply follow 
the prompts as needed and answer yes as needed for resource usage. 


3. Type sudo rstudio and press Enter. 
RStudio should now start as expected. 


You may need to continue using sudo to start RStudio with certain versions of 
Linux. The problem appears to be one of gaining the appropriate privileges. 


REMEMBER 


Installing R on Mac OS X 


The Mac OS X installation of R comes in two parts. The first is the installation of R. 
The current implementation doesn’t support older Mac platforms; in order to use 
R, you must have release 10.9 and above. Even though the installation instructions 
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WARNING 


TIP 


at https: //cran.r-project.org/bin/macosx/ discuss the use of older versions 
of R with older versions of the Mac, this book doesn’t support these older versions. 
Mac OS X versions 10.6 through 10.8 also rely on an older version of R, 3.2.1, which 
likely won’t work with the examples in this book. 


Take special note of the warning provided on the CRAN site. CRAN doesn’t have 
MAC OS X systems and doesn’t check the binary distributions for viruses. This 
means that you should take precautions such as scanning the file before you try to 
install it. Even though downloading a virus from the CRAN site is somewhat 
unlikely, taking precautions to keep your system safe still pays. The following 
steps help you install R on a Mac system: 


1. Locate the downloaded copy of the R package on your system. 


The name of this file can vary, but it normally appears as R-3.2.3.pkg. The 
filename contains the version number. This book assumes that you install 
version 3.2.3. If you use some other version, you may experience problems 
with the downloadable source code and need to make adjustments when 
working with it. 


2. Double-click the installation file. 


The installer starts. You see an installer window where you can choose options, 
such as the language to use. 


3. Click Install. 


The installer displays status information. You may see requests for additional 
information and may need to provide authentication information during the 
installation process. The installer eventually completes the installation process. 


4. Click Continue. 


You're ready to begin using R. 


Some Mac users prefer to use Homebrew (http: //brew.sh/) to install R on their 
system. This utility relies on command-line instructions that some people find 
easier to use. You can find a discussion of how to use Homebrew to install R ona 
Macathttps: //davidsimpson.me/2013/02/26/instal1ling-r—on-os—x/. Because 
this installation requires manual typing of commands and installing a third-party 
utility, using the installer whenever possible is a good idea. 


After you have R installed on your Mac, you can install RStudio. Follow these steps 
to perform the installation: 
1. Locate the downloaded copy of RStudio on your system. 


The name of this file varies, but normally it appears as RStudio-@.99. 491. 
dmg. The version number is embedded as part of the filename. In this case, the 
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filename refers to version 0.99.491, which is the version used for this book. If 
you use some other version, you may experience problems with the down- 
loadable source code and need to make adjustments when working with it. 


2. Drag the installation file to your Applications folder. 


The installer starts. You see an installer window where you can choose options, 
such as the language to use. 


3. Click Install. 


The installer displays status information. You may see requests for additional 
information and may need to provide authentication information during the 
installation process. The installer eventually completes the installation process. 


4. Click Continue. 


You're ready to begin using RStudio. 


Downloading the Datasets and 
Example Code 


This book is about using R and Python to perform machine science tasks. Of 
course, you can spend all your time creating the example code from scratch, 
debugging it, and only then discovering how it relates to machine learning, or you 
can take the easy way and download the prewritten code so that you can get right 
to work. Likewise, creating datasets large enough for machine learning purposes 
would take quite a while. Fortunately, you can access standardized, precreated 
data sets quite easily using features provided in some of the machine learning 
libraries. The following sections help you download and use the example code and 
datasets so that you can save time and get right to work with machine learning- 
specific tasks. 


Understanding the datasets 
used in this book 


Most of the datasets used in this book come directly from the R datasets package. 
You can find a listing of these datasets at https: //stat.ethz.ch/R-manual / 
R-devel/library/datasets/htm1/0@Index.htm1. If you want to see these data- 
sets in RStudio, simply type data() in a blank file, and the IDE shows it to you. 
Figure 4-11 shows typical output from typing data( ). 
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1 dataO ©} Global Environment 

|B AirPassengers i + | AirPassengers 
@ Bisales The classic Box & Jenkins airline data. Monthly totals of international 
@ BOD airline passengers, 1949 to 1960. 

FIGURE 4-11: @ co2 Seep Pan A 

The data() @ Chickweight 

function shows a @ DNase 
list of datasets @ EustockMarkets 
found in the @ Formaldehyde = Files Plots Packages Help Viewer 
datasets package. Q New Folder__® Delete (4) Rename 


In addition to these standard datasets, the book uses supplemental datasets. In 
many cases, these supplemental datasets are the same as those used for Python 
examples and let you compare the output of R to Python. These supplemental 
datasets are 


>> Boston: https: //stat.ethz.ch/R-manual/R-devel/library/MASS/ 
htm1/Boston.html 


>» Air Quality (airquality): https: //stat.ethz.ch/R-manual/R-devel/ 
library/datasets/html/airquality.html 


» Titanic: https: //stat.ethz.ch/R-manual /R-devel/library/datasets/ 
html /Titanic.html 


>» Iris: https: //stat.ethz.ch/R-manual /R-devel/library/datasets/ 
html/iris.html 


>> SMS Spam Collection (sms_data): http: //www.dt.fee.unicamp. 
br/~tiago/smsspamcollection/ 


In most cases, you download and use a special dataset by using a permutation of 
the data() function that includes the name of the dataset and the package in 
which it resides. For example, to use the Boston dataset, you use the data(Boston, 
package=“MASS” ) function call. The book explores the use of datasets as needed. 


Defining the code repository 


The code you create and use in this book will reside in a repository on your hard 
drive. Think of a repository as a kind of filing cabinet where you put your code. 
RStudio makes working with a repository easy. You essentially use the same 
structure that you always use to store files on your system. The following sections 
describe how to set up RStudio for use with this book. 


Defining the book's folder 


It pays to organize your files so that you can access them more easily later. This 
book’s files are in a folder called ML4D (Machine Learning For Dummies). Use 
these steps within RStudio to create a new folder. 
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FIGURE 4-12: 

Use the ML4D 
folder to hold the 
files you create in 
the book. 


1 e Click New Folder in the Files tab. 


You see a New Folder dialog box, where you can type the name of the folder 
you want to use. 


2. Type ML4D and click OK. 
RStudio creates a new folder named ML4D and places it in the folders list. 
3. Click the new ML4D entry in the list. 


RStudio changes the location to the ML4D folder where you perform tasks 
related to the exercises in this book, as shown in the lower-right corner of 








Figure 4-12. 
ORStudio o/s) 
File Edit Code View Plots Session Build Debug Tools Help 
oli 7- BB B) Project: (None) ~ 
ẹ@ ) Untitled1 mC Environment History anf} 
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1 f} Global Environment» 
Files Plots Packages Help Viewer 
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Type ‘“contributors()* tor more intormation and , 
‘citation()’ on how to cite R or R packages in 
publications. 


Type ‘demo()’ for some demos, 'help()' for on- 
line help, or 

"help.startQ’ for an HTML browser interface t 
o help. 

Type 'q()' to quit R. = 


> 





Creating a new file 


This book doesn’t do anything too fancy with R. In most cases, you create new 
files as needed to hold scripts. To create a new script file, choose Files New File>R 
Script or press Ctrl+Shift+N. When you need to create other kinds of files, the book 
gives instructions for doing so, but for now, all you need to know how to do is 
create script files. 


CHAPTER 4 Installing an R Distribution 61 


Saving a file 


After you create some code, you want to save it to disk. Choose File = Save or press 
Ctrl+S to perform this task. R files normally have a .r file extension. 


Removing a file 


In some cases, you might want to remove an old file. The file might contain mis- 
takes or simply old information. The easiest way to remove an old file is to select 
the check box next to its entry in the Files tab and click Delete. RStudio then 
removes the file from your drive for you. 


Opening a file 
To use the source code from this book, you must open the downloaded files. The 
following steps provide you with a quick method for changing the working folder 
location and then opening the files in RStudio. 
1. Click the ellipses (the three dots) next to Home in the Files tab. 
You see a file-browsing dialog box consistent with your platform. 
2. Select the folder you used to store the downloaded source code and click OK. 
RStudio displays the folder you selected in the Files tab. 
3. Choose More © Set As Working Directory in the Files tab. 
RStudio sets the downloaded source code folder as your working directory. 
4. Click any of the .r files that you see in the folder. 


You see the script loaded in RStudio. 
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IN THIS CHAPTER 


Discovering the basic R data types 
Using vectors in R 

Creating and using data lists 
Using matrices in R 


Defining multiple dimensions using 
arrays 


Working with data frames 


Using R to perform statistical tasks 


Coding in R Using 
RStudio 


efore you can work effectively with R, you need to know something about it. 

People have written entire books, dedicated websites, and created training 

courses on the subject of working with R and most likely haven’t exhausted 
the topic yet. This chapter serves as a simple overview of R. The purpose of this 
chapter is to give you enough information to understand the examples throughout 
the rest of the book. It’s designed around the needs of someone who already 
knows how to perform basic coding and just needs some information about R to 
get more out of the material that appears in the chapters that follow. 


One thing that this chapter does that you won’t find in other books, websites, and 
so on is a slant toward machine learning. You aren’t just getting an overview, but 
an overview that helps you understand R from a machine learning perspective. For 
this reason, you might want to scan the chapter even if you already know some- 
thing about R. This chapter doesn’t make you an R genius, but it does help prepare 
you for using R in a machine learning environment. 


CHAPTER 5 Coding in R Using RStudio 63 


Understanding the Basic Data Types 
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In contrast to most programming languages, R doesn’t require you to declare a 
variable of a specific type. R stores data in R-objects. These objects automatically 
assign a type to the data based on the manner used to store it and the data char- 
acteristics. You use the following R-objects to store data: 


» 
» 
» 
» 
» 
» 


Vectors 
Lists 
Matrices 
Arrays 
Factors 


Data frames 


R can support other R-object types, but these are the most common. This chapter 
discusses the R-object types used most often in the book. The easiest R-object to 
understand and use is the vector. A vector can contain the following basic data types: 


» 


» 


» 


» 


» 


Logical: Contains a Boolean value such as TRUE or FALSE that indicates a 
truth value. 


Numeric: Contains any number such as 12.3 or 45. A numeric value can 
include an integer (a whole number without a decimal point) or a real number 
(one that contains a decimal point). 


Integer: Holds an integer value — a number without a decimal point, such as 
2L or 34L. The number must not contain a decimal and you must append the 
letter L to the number to indicate that you want it to appear as an integer. 


Complex: Specifies a numeric value that contains both a real and imaginary 
value, such as 3 + 2i. 


Character: Defines a single character or a string, such as 'a', "Hello", 
"TRUE", or '12.3'.A character data type can use either single or double 
quotation marks as a delimiter. 


It’s also possible to store strings in raw format using the Raw data type. To create 
se, a Raw data type, you use the charToRaw( ) function. The resulting R-object con- 
tains the numeric values of each character in the original string. For example, the 

TECHNICAL following code: 


STUFF 


MyVar <- charToRaw('Hello!') 
print(MyVar) 
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produces the following output: 
[1] 48 65 6c 6c 6f 24 


You can determine the type of a variable using the class() function. For example, 
if you use theclass( ) function with the previous code, as inprint(class(MyVar) ), 
you get the following output: 


" 


[1] "raw 


OBTAINING ADDITIONAL INFORMATION 
ABOUT R 


When you start delving into R, you might find that you absolutely can't go on without 
some additional information. Fortunately, the R community is prepared to help you in 
all sorts of ways. You may want to visit these sites after reading this chapter if you learn 
best simply by reading material (each of these sites documents R in some way): 


© https: //www.r-project.org/other-docs .html 
e https: //cran.r—project.org/manuals.htm1 


e http: //www.rdocumentation.org/ 


Some people learn better with tutorials. You can find many tutorials of various types 
online. In many cases, you use them to discover specific techniques by viewing practical 
examples and then practicing the technique yourself. These sites offer various kinds of 
tutorials that may help you in your journey to become more experienced with R: 

e http://www. tutorialspoint.com/r/ 

© http: //www.r-tutor .com/ 

e https: //www.rstudio.com/resources/training/online-learning/ 
Another way to work with R is to view videos. More and more people are turning to 
videos as a means for learning new techniques without actually reading anything or 
participating in a tutorial environment. Seeing someone else perform the task does 
have advantages in that you can focus on the technique rather than how to implement 
it. Here are a few places to find videos (some of them are multipart — the links point to 
Part 1) that can help you gain experience with R: 

e https: //www. youtube. com/watch ?v=TGo9F @QyBuE 

® https: //www. youtube. com/watch?v=ZmtkqaRVTDc 


e https: //www. youtube. com/watch ?v=HK jSKtVV6GU 
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Working with Vectors 


REMEMBER 


Vectors can store single values, as described in the "Understanding the Basic Data 
Types” section of the chapter, or they can store a group of values. When working 
with a group of values, you use the c( ) function to combine them. All the mem- 
bers of a vector are the same type. Consequently, when you create a vector using 
code such as MyNumbers <- c(1, 2, 3), you get a vector of numbers. The 
print(MyNumbers ) function shows three distinct values: 


[4] 123 
When you use the class(MyNumbers) function, you see an output of: 
[1] "numeric" 
All the values in a vector must be the same type. Consequently, when you mix types 
in a vector, such as by calling MyVector <- c(1, "Hello", TRUE), you still get 
three distinct values. However, the print(MyVector ) function shows this output: 
[1] a "Hello" "TRUE" 
Calling class(MyVector ) tells you that all the entries are now of type character. 
When working with vectors, R automatically performs any data type conversions 


required to maintain the integrity of the R-object. Consequently, you can encoun- 
ter unwanted changes when using the wrong R-object type. 
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Unlike vectors, lists can contain multiple data types. In fact, lists can contain 
vectors, other lists, and functions (a topic discussed in the “Working with func- 
tions” section, later in the chapter). Consequently, lists can be more versatile than 
vectors in storing data precisely as you intend to store it. 


To create a list, you must use the list() function. Otherwise, R assumes that you 
want to create a vector. For example, MyList <- list(1, "Hello", TRUE) creates 
a list with three separate entries of the type you expect. Using the print(MyList) 
function displays the following output: 


[[4]] 
[1] 4 
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[[2]] 
[1] "Hello" 


[[3]] 
[1] TRUE 


Each entry is separate. Consequently, when you call class(MyList), the output is 
a list. 


Every entry is also a list. To test that each entry is a list, you need to know about 
indexing. An index provides you with access to a particular list member. For 
example, call print(MyList[1]) and you see the first member of the MyList list: 


Eai 
[4] 4 


However, what you’re seeing is a list containing the actual data that you want to 
access. When you use class(MyList[1]) function, you still see an output of a list. 
To actually access the inner data member, you must access the element within 
remember the list. Use the print(MyList[[1]]) function to see the value of 1 and 
class(MyList[[1]]) function to see a data type of numeric. 


A list can contain vectors. To create a vector within a list, you use the c( ) function. 
For example, MyVectorList <- list(c(1, 2, 3)) creates a list that contains a 
vector. The print(MyVectorList) function shows the following output: 


[[4]] 
iaj ai ae 


To access a particular member of the vector, you again use embedded indexes. For 
example, if you want to print the second member of the vector, you use the 
print(MyVectorList[[1]][2]) function. The output is as follows: 


[1] 2 


Working with Matrices 


Vectors and lists are one-dimensional R-objects. However, you often need two 
dimensions to express data accurately. A matrix provides a two-dimensional 
method for expressing data. You can create matrices containing any sort of data, 
but you normally find them used to hold numeric data. The following sections give 
you an overview of using matrixes. 
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Creating a basic matrix 


The matrix represents a change in R complexity. As with the list, you use a special 
function, the matrix() function, to create a matrix. However, in contrast to the 
list, a matrix() function call can contain more than just data. In fact, a matrix() 
function can include these elements: matrix(data, nrow, ncol, byrow, dimnames). 
Each of the elements is an argument to the function that helps R create precisely 
the matrix you want. Here is a quick description of each of these arguments: 


>> data: Defines the data used to fill the data elements in the matrix. The data 
normally appears as part of a vector. 


>> nrow: Specifies the number of rows to create. 
>> ncol: Specifies the number of columns to create. 


>> byrow: Defines the arrangement of the vector elements within the matrix. 
When set to TRUE, R arranges the vector elements by row. 


>> dimnames: Determines the names assigned to the rows and columns. 


When creating a basic matrix, you can create an empty matrix by calling matrix() 
without any arguments, but such a matrix isn’t particularly useful. Normally you 
provide input dataasaminimum. Forexample, callingMyMatrix <- matrix(c(1:5)) 
creates a matrix that contains the values 1 through 5 arranged in a single column. 
Note that this example uses a new form of the c() function that specifies a range 
of values, 1 through 5, by separating the start of the range from the end of the 
range with a colon. The print (MyMatrix) function call shows that the output from 
this matrix is 


IL al 


oF WON FO 


If you want your matrix to contain more than one column, you must specify the num- 
ber of rows as a minimum. For example, calling MyMatrix <- matrix(c(1:6), 3) 
creates a matrix with three rows. The print(MyMatrix) function call shows the 
following output: 


[,4] [,2 
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Specifying the number of columns defines how many columns you see as well. In 
this case, when the input data doesn’t contain enough information to fill the 
matrix, R simply repeats the data until the matrix is full. For example, calling 
MyMatrix <- matrix(c(1:6), 3, 4) creates a matrix with three rows and four 
columns. However, there isn’t enough data to fill the matrix. Consequently, when 
you call the print (MyMatrix) function, you see the data duplicated as shown here. 


Loe) Well ell TA 
[1,] 1 4 4 4 
[2,] 5 2 IS 
[21 3 6 3 6 


Changing the vector arrangement 


When you use the basic matrix construction techniques, you get a matrix that fills 
the elements in column order. For example, when you call MyMatrix <- 
matrix(c(1:8), 2, 4), you get an output of 


Unfortunately, you might not always want to fill the matrix in column order. 
That’s where the byrow argument comes into play. When you set this argument to 
TRUE, the order used to fill the matrix changes to a row order. Calling MyMatrix 
<- matrix(c(1:8), 2, 4, TRUE) produces an output of 


Accessing individual elements 


As with lists, you use indexing to access the individual elements of a matrix. How- 
ever, now you have two dimensions to deal with. Consequently, in order to access an 
individual element, you normally specify both a row and a column index. For exam- 
ple, when using the matrix created by the MyMatrix <- matrix(c(1:8), 2, 4, 
TRUE ) function call, a call to print (MyMatrix[2,2]) produces an output of [1] 6. 


You always specify the row first, and then the column. For example, 
print(MyMatrix[1, 2]) produces an output of [1] 2, but print(MyMatrix[2, 
1] ) produces an output of [1] 5. Specifying the row and column in the wrong 
order causes problems with your scripts. 
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You can also access multiple elements by using ranges. For example, the 
print(MyMatrix[1:2, 2]) function call produces an output of 


[aj] 2 


You can also use ranges for the column. For example, the print(MyMatrix[1:2, 
2:3] ) function call produces an output of 


[,4] [,2] 
[1,] 2 3 
[2,] 6 7 


Naming the rows and columns 


In some cases, naming the rows and columns makes the data more meaningful. 
Humans don’t do exceptionally well with numerically named rows and columns 
because we forget what the numbers signify. To name the rows and columns, you 
must provide a vector containing the names. The row names appear first, followed 
by the column names. Here is an example that shows both rows and columns 
named: 


RN = c("Row1", "Row2") 
CN = c("Col4", "Col2", "Col3", "Col4") 
MyMatrix <- matrix(c(1:8), nrow=2, dimnames=list(RN, CN)) 


You should notice something new about the matrix() function. Until now, the 
function relied on positional arguments — those made in order and by position. 
When using positional arguments, you must include every argument, and the 
argument must appear in the proper position. This example uses named arguments, 
in which the name of the argument appears with an equals sign, followed by the 
argument value. When using named arguments, you don’t have to provide every 
argument, and the arguments don’t have to appear in order. The output from this 
example is 


Colt Col2 Col3 Col4 
Row1 1 3 5 T 
Row2 2 4 6 8 


Using named rows and columns makes element access clearer. For example, to see 
the content of Row2, Col2, you use the print(MyMatrix["Row2", "Col2"]) func- 
tion call. The output is [1] 4. 


Now that you have row and column names, you need to know how to access them 
outside of creating a new matrix. The rownames() function works with the row 


PART 2 Preparing Your Learning Tools 


TIP 


names and the colnames( ) function works with the column names. For example, 
to retrieve the current set of column names, you use the colnames(MyMatrix) 
function call. To set the column names to a different value, you simply set 
the vector using the colnames() function. Using the colnames(MyMatrix) 
<- e("Columni", "Column2", "Column3", "Column4") function call changes 
the name of the columns so that the output now appears as 


Columnt Column2 Column3 Column4 
Rowt 1 3 5 T 
Row2 2 4 6 8 


You may also decide that you no longer want either the rows or the columns 
named. The built-in value NULL means nothing. It’s what you use when you don’t 
want an element to have any value at all. Using the rownames(MyMatrix) <- NULL 
command sets the row names to nothing. Consequently, the output changes to 


Columnt Column2 Column3 Column4 
[ay] al g 5 7 
[2,] 2 4 6 8 


Interacting with Multiple Dimensions 
Using Arrays 





REMEMBER 


Data doesn’t always fall neatly into one or two dimensions. Sometimes you need 
three or more dimensions to express a dataset adequately. In this case, you can 
use an array to hold the information. An array can contain multiple dimensions of 
data to express complex relationships. Even though this chapter works with 
three-dimensional arrays, arrays can contain any number of dimensions. 


An array uses many of the functions that a matrix uses to perform tasks. In fact, 
an array is a structure of matrices organized in a manner that allows the use of 
multiple dimensions. The following sections provide you with an overview of 
arrays and present some simple ways to use them. 


Creating a basic array 


You use the array( ) function to create arrays. The array syntax of array(data, 
dim, dimnames) provides the means to create reasonably complex arrays. 
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>> data: Defines the data used to fill the data elements in the matrix. The data 
normally appears as part of a vector. 


>» dim: Specifies the number of rows to create. 


>> dimnames: Determines the names assigned to the rows and columns. 


As with a matrix, you can create a blank array and add information to it later. 
In most cases, even a simple array will contain data. For example, MyArray 


<- array(c(1:8)) creates a one-dimensional array with values from 1 through 8 
that looks and acts much like a vector. 


A more practical array contains at least two dimensions. You can specify the 
dimensions at the time you create the array, as you do when creating a matrix, or 
you can re-dimension the array later. For example, dim(MyArray) <- c(4,2) 
changes the previous array from a one-dimensional array to a two-dimensional 


array that looks much like a matrix. After the change, you see the following 
output: 


Bon ee 
ONO oO 


Of course, you can re-dimension the same array into a three-dimensional array 


by calling dim(MyArray) <- c(2,2,2). The output now consists of two matrixes, 
like this: 


Naming the rows and columns 


As when working with a matrix, you can name the elements in an array. Using 
names can make working with the array easier. For example, you can use the 
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MyArray <- array(c(1:8), c¢(2,2,2), list(c("Row1","Row2"),c("Col1i", 
"Col2"),c("Mat1","Mat2"))) function call to create an array with named 
elements. Notice the order: row names, column names, and matrix names. You 
can also set the names individually, just as you can when working with a matrix. 
Display the array using print (MyArray ). The output of this example is 


, , Mat 


Col1 Col2 
Row1 1 3 
Row2 2 4 


, , Mat2 


Col1 Col2 
Rowt 5 y 
Row2 6 8 


To access a specific array element, you specify its index much as you do with a 
matrix, but you need to provide information for each dimension. For example, 
print(MyArray["Row1", "Col2", "Mat2"]) displays an output of [1] 7. You can 
also provide a partial index or use numbers in place of names. For example, 
print(MyArray[,,2]) displays the following output: 


Coli Col2 
Rowt 5 T 
Row2 6 8 


Note that even though you’ve named the elements, you can still use numbers to 
access them. Arrays support renaming using the same techniques as a matrix. 
There isn’t a matnames( ) function, though. You have access to rownames() and 
colnames() as before, but to rename a matrix, you use the generic dimnames( ) 
function. The dimnames() function works with any of the dimensions, no matter 
how many there are, in an indexed manner. In this case, the matrix names appear 
as the third index, so you use dimnames(MyArray)[[3]] <- c("Matrixt", 
"Matrix2") to rename the matrixes. The output from this change is 


, , Matrixi 
Col1 Col2 


Row1 al 3 
Row2 2 4 
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, , Matrix2 


Col1 Col2 
Rowt 5 7 
Row2 6 8 


Creating a Data Frame 
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REMEMBER 


Data frames are tabular structures. In other words, they’re akin to a matrix or a 
two-dimensional array in that they have rows and columns. However, data frames 
work more like a database. Each column represents a variable and each row a 
record. For example, when describing a person, you might include name and height. 
The names and heights represent two columns — two variables used to describe a 
person. A combination of an individual name and height is a row — a single record 
within the data frame. The following sections discuss data frames in more detail. 


Understanding factors 


A factor is a special structure that enable you to categorize data. You can create 
them using either numbers (with or without a decimal point) or strings. What 
makes factors special is that you see unique values only after creating a factor. 
Each of these values is a level. To create a factor, you use the factor() function. 
For example, calling MyFactor <- factor(c("North", "South", "East", 
"West", "West")) creates a factor containing strings. The output of 
print(MyFactor ) is 


[1] North South East West West 
Levels: East North South West 


Even though the factor has five entries, it has only four levels because the West 
value is a repeat. The factor tracks individual data entries, but the levels contain 
only unique values. Use the levels() function, such as levels(MyFactor), to 
obtain just the levels. If you want to count the levels instead, you use the nlevels() 
function, such as nlevels(MyFactor ), which produces an output of [1] 4. 


You can also change the order of the levels by modifying the levels entry witha 
vector of new entries, such as levels(MyFactor) <- c("North", "South", 
"East", "West"). However, change the entries with caution, because R also 
renames the individual values. The new output of print(MyFactor ) is 


[1] South East North West West 
Levels: North South East West 
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In this case, R renames all North values to South because South now appears 
where North originally appeared. Likewise, all East entries now appear as North 
entries. In fact, the only entry that hasn’t changed is West. The best way to change 
factor level order is to place the entries in a new variable. The following code shows 
an example. Calling SecondFactor <- factor(MyFactor, levels=c("North", 
"South", "East", "West")) produces an output of 


[1] North South East West West 
Levels: North South East West 


The entries are still the same, but the order of the levels has changed. When mak- 
ing changes to original data, make sure that you test any changes to determine 
whether they have unforeseen consequences. If you have doubts as to whether a 
change will work properly, place the modified data into a new variable. 


Understand that R provides for all sorts of data manipulations, and one manipula- 
tion can affect factors in particular. Say that you create a numeric factor, such as 
NumFactor <- factor(c(1, 2, 3, 4, 1, 2, 1, 3)). When you print NumFactor, 
you get the expected result: 


IW] d 2Saa aa es 
Levels: 1234 


The numeric values might have seemed like a good idea at first, but now you find 
that string labels would work better. Use the labels argument to make the change 
and place the values in a new variable like this: StrFactor <- factor(NumFactor , 
labels=c( "North", "South", "East", "West")). R converts the numeric data 
to string data and places it in StrFactor so that the result appears like this: 


[1] North South East West North South North East 
Levels: North South East West 


So far, you’ve seen how to count and list the levels, but sometimes you need 
to know how many times each entry actually appears in the factor. In this 
case, you rely on the table() function. For example, using Appearances <- 
table(StrFactor) tallies the number of times each string appears in the 
StrFactor factor. Printing the output shows the following result: 


StrFactor 
North South East West 
3 2 2 1 


In this case, North appears three times in the factor, but West appears only once. 
You may want to know how many times just one of the values appears. In this 
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case, you can rely on an index, such as Appearances["North"], which produces 
an output of 


North 
3 


Creating a basic data frame 


Data frames provide the means to mix data types within a single structure. When 
working with R, you can use columns of the numeric, character (strings), and fac- 
tor data types. Each of the columns contains the same number of entries and form 
rows when put together. To give you some idea of how this all works, you could 
create some columns using the following code: 


Names <- c("Jane", "Sam", "Jose", "Amy" ) 
Ages <- c(42, 33, 39, 25) 
Locations <- factor(c("West", "West", "South", "South") ) 


Note that the columns are all the same length and each will contribute toward the 
records in the data frame. To create the data frame, you use the data. frame() 
function. Be sure to include the period between data and frame. 


You can assign a name to each column in the data frame simply by including it as 
part of the call. To create the example data frame you call EmpData <- data. 
frame(Name=Names, Age=Ages, Location=Locations ). When you print EmpData, 
you get this output: 


Name Age Location 
1 Jane 42 West 
2 Sam 33 West 
3 Jose 39 South 
4 Amy 25 South 


As you can see, the data entries now appear in a tabular format, and you can 
access the information by either row or column. The access technique is a little 
odd. You can use an index just as you normally do, but the indexing method is dif- 
ferent from other R-objects. To access the first column, you call EmpData[1] to 
obtain the following output: 


Name 
1 Jane 
2 Sam 
3 Jose 
4 Amy 
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To access the first row, you must append a comma to the index number. This 
means that you call EmpData[1,] to see the following output: 


Name Age Location 
1 Jane 42 West 


When working with columns, you can also use a string index. For example, 
EmpData["Age"] accesses the ages in the Age column. If you want to access a 
particular row and column, you use the same indexing approach of row, and then 
column, value as you normally do. For example, EmpData[1, 2] provides an out- 
put of [1] 42. 


A final method of access is to interact with the vectors containing the column 
values directly. This method relies on the $ operator. For example, to access the 
Age vector, you call EmpData$Age, which produces the following: 


[1] 42 33 39 25 


Normally, R converts all strings to factors. Using this approach enables you to use 
your string data as a method for categorizing the data. However, you might not 
want to see the string data as factors. In this case, you set the stringsAsFactors 
argument to FALSE. For example, calling EmpData <- data. frame(Name=Names, 
Age=Ages, Location=Locations, stringsAsFactors=FALSE) would create the 
Name column as the character type rather than as a factor. 


Interacting with data frames 


As the book progresses, you see all sorts of ways to interact with data frames. 
However, you need to know a few basic interactions before getting into some of 
the material that follows. The following sections describe these basic interactions 
using the EmpData data frame created in the previous section of the chapter. 


Querying the data frame structure 


At times, you need to know the data structure of a data frame. The str() function 
helps you perform this task. For example, str(EmpData) produces the following 
output (assuming that you set stringsAsFactors to FALSE): 


"data. frame': 4 obs. of 3 variables: 
$ Name : chr "Jane" "Sam" "Jose" "Amy" 
$ Age : mum 42 33 39 25 


$ Location: Factor w/ 2 levels "South","West": 22114 
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By knowing the structure of the data frame, you can determine how best to work 
with the data. For example, you can use Location to help categorize and organize 
the data in this case. 


Summarizing data frame data 


Obtaining a statistical summary of the data in a data frame can often provide you 
with ideas on how you might perform some types of analysis or whether the data 
frame requires additional manipulation before it becomes suitable for machine 
learning. The summary( ) function helps you perform this task. For example, if you 
call summary(EmpData), you obtain the following output, which tells you about 
each of the columns in the data frame. 


Name Age Location 
Length: 4 Min. 1:25.00 South: 2 
Class :character Ast Qu.:31.00 West :2 
Mode :character Median :36.00 

Mean 734.79 
Srd Qu. :39.75 
Max. 142.00 


Extracting the data in a data frame 


The data frame you have might contain more information than you really need 
for your analysis (or perhaps you want to split the data frame into training and 
testing parts). The technique you use to extract data depends on the result you 
want to obtain. For example, you might not need the Location column for your 
analysis. You can create a subset of the data frame by extracting just the Name 
and Age columns by calling SubFrame <- data. frame(Name=EmpData$Name, 
Age=EmpData$Age ). The resulting data frame looks like this: 


Name Age 
1 Jane 42 
2 Sam 33 
3 Jose 39 
4 Amy 25 


Alternatively, you might need specific rows. In this case, you can use indexing. For 
example, you could create a data frame containing just rows 2 and 3 by calling 
SubFrame <- EmpData[2:3,].In this case, you see the following output: 


Name Age Location 
2 Sam 33 West 
3 Jose 39 South 
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Expanding a data frame 


In some cases, you need to expand a data frame to include more information. For 
example, you may find that you need to combine data sources to create a data set 
that’s suitable for analysis, which means adding columns. You may also obtain 
additional information from another source, which means adding rows. The 
following sections describe how to perform these two tasks. 


Adding a column 


When adding a column, you provide enough entries to add the new variable to all 
the existing records. The data appears as part of a vector. For example, you might 
need to add a hire date to the EmpData data frame used for previous examples. The 
following code shows how to perform this task: 


HireDates <- as.Date(c("2001/10/15", "2012/05/30", 
"2010/06/28", "2014/04/02")) 
EmpData$HireDate <- HireDates 


This example shows how to create a vector of dates. The as .Date() function lets 
you tell R to treat the strings as dates rather than strings. The example uses the $ 
operator to define a new column, HireDate, and assign the HireDates vector to it. 
The new version of EmpData looks like this: 


Name Age Location HireDate 
1 Jane 42 West 2001-10-15 
2 Sam 33 West 2012-05-30 
3 Jose 39 South 2010-06-28 
4 Amy 25 South 2014-04-02 


Note that the code for creating HireDates appears on two lines. You can use mul- 
tiple lines in R to make the code more readable. You don’t have to indent the sec- 
ond and subsequent lines, but most developers do so by convention to make the 
code easier to understand. The computer doesn’t need the indentation, but human 
viewers do in order to see which lines belong together as a single call. 


Adding a row 


When adding a row, you create a new record that contains all the columns found 
in the existing rows. The data appears as part of a data frame, so you use the same 
technique that you use with any other data frame to create it. To add the new row 
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to the existing data frame, you use the rbind() function, as shown in the follow- 
ing example: 


NewEmp <- data. frame( 


Name = "Kerry", 
Age = 51, 
Location = "West", 


HireDate = as.Date("2016/06/28"), 
stringsAsFactors = FALSE) 


EmpData <- rbind(EmpData, NewEmp) 


The new record, NewEmp, contains all the columns currently found in EmpData 
(assuming that you’ve made the other changes listed in the chapter). When using 
the rbind() function, put the data frames in the order in which you want to see 
them in the result. With the addition of this row, EmpData now looks like this: 


Name Age Location HireDate 
1 Jane 42 West 2001-10-15 
2 Sam 33 West 2012-05-30 
3 Jose 39 South 2010-06-28 
4 Amy 25 South 2014-04-02 
5 Kerry 51 West 2016-06-28 


Performing Basic Statistical Tasks 


So far, the chapter demonstrates various data structures and techniques for work- 
ing with them. However, just creating a data structure isn’t enough in many cases. 
The following sections provide an overview of some basic tasks you can use to 
perform statistical analysis. These sections help you understand some of the 
examples later in the book. 


Making decisions 


To perform useful work, computer languages provide the means for making deci- 
sions. R provides three decision-making structures: if, if...else, and switch. 
Use the if forms when you make decisions with a limited number of outcomes. The 
switch form works best when you make a decision with multiple outcomes. The 
following sections describe these decision-making structures in more detail. 
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Using the if statement 


The if statement lets you make a simple decision. When something is true, R 
executes the code in the associated code block. Here is an example of the if state- 
ment in action: 


A <- 3 

B <- 4 

E ARB) 
{ 


print("Less than!") 


The code begins with if, as you might expect. The part within the parentheses is 
an expression and must evaluate to TRUE or FALSE. The two variables, A and B, 
appear between the < (less than) operator. R supports all the usual operators, 
including <, <=, ==, !=, >=, and >. You can find a full list of the R operators at 
https: //stat.ethz.ch/R-manual /R-devel/library/base/html/Syntax.html. 


Placing all code blocks within curly brackets is good practice, as shown in the 
example. This particular example would have worked without the curly brackets, 
but not using the curly brackets leads to errors and makes the code harder to read. 
The output of this example is [1] "Less than!". 


Using the if... else statement 


The if. ..else statement works similarly to the if statement, except that it pro- 
vides a response when the expression isn’t TRUE. When the expression evaluates 
to FALSE, the else portion of the if. . .else statement executes. Here is an exam- 
ple of the if. ..else statement at work. 


Myserings <= e( ims", Mie", "a", “siciedine).™)) 


if ("This" %in% MyStrings) 


{ 
print("Found!") 
} else 
{ 
Print("Not Found.") 
} 


This example uses a special operator, 4in%. These special operators perform unusual 
comparisons not commonly found in other languages. In this case, the 4in% opera- 
tor determines whether "This" appears somewhere within the MyStrings vector. 
Later chapters show how to use other special operators. The output for this example 
is [4] "Found!". 
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Using the switch statement 


A switch statement can provide one of several outputs, depending on the input 
value. There are several forms of this statement, but the most common reacts to 
numeric input, as shown in the following example: 


Num <- as.integer(read1ine( 
"Enter a number between 1 and 5: ")) 


Result <- switch( 
Num, 
"One", 
"a", 
"It's Three!", 
"Almost There!", 
"Done!" ) 


print(Result) 


The code begins by asking the user to provide an input value. Even though you 
don’t use this feature very often in the book, knowing how to get input from the 
keyboard is helpful when you experiment. R provides input as a string, unless you 
convert the value to some other type, such as using the as. integer() function 
shown in this example. 


The variable Num now contains a number from 1 through 5. The switch() function 
creates a variable to hold the evaluation and output of Num. For example, when a 
user types 1, Result outputs "One". 


Working with loops 


Loop statements help you perform tasks more than once. R supports three kinds 
of loop statements: repeat, while, and for. Each loop statement has certain 
properties. For example, the for loop always executes a specific number of times. 


Using the repeat loop 


The repeat loop continues to execute until you break out of it. The other two 
statements optionally support the break statement, but when working with the 
repeat loop, you must break out of it by checking for a condition. The following 
example shows the repeat loop in action. 
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Count <- 1 


repeat 
{ 
print(Count) 
if (Count > 5) 
{ 
break 
} 


Count <- Count + 1 


The example begins by creating a counter variable, Count. When using repeat, 
you must provide some means of ending the loop. The code then prints the current 
value of Count, determines whether Count is greater than 5, and then updates 
Count to the next value. This last step is essential or the loop will never end. Here 
is the output from this example. The repeat loop always executes at least once. 


SSS es es es 
Sa 
oor WN KE 


Using the while loop 


The while loop checks for an ending condition before it starts the loop, which 
means that it may not execute even once. The basic principles are the same as 
when using the repeat loop in that you must provide some means for ending the 
loop, which usually means performing updates on a variable within the loop’s 
code block. Here is an example of the while loop in action: 


Count <- 1 


while (Count <= 6) 

{ 
print(Count) 
Count <- Count + 1 


This example has the same output as the repeat loop but uses less code. A while 
loop can perform tasks more efficiently than a repeat loop under the right condi- 
tions. However, you must account for the fact that the loop might not ever 
execute. 
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Using the for loop 


A for loop executes code a specific number of times. You usually use it to perform 
tasks with structured data. For example, you may need to process the elements of 
a vector individually. Here is an example of the for loop in action: 
iVSarings <= Gums”, “is”, "a", “siticiine).”)) 
for (AString in MyStrings) 
{ 
print(AString) 


In this case, the for loop places the values in the vector, MyStrings, into the 
string, AString, one at a time. When you execute this example, you see the fol- 
lowing output: 


"This" 


" " 


is 


a 


[4 
[4 
[4 
[4] "string." 


m e a a 


Performing looped tasks without loops 


In some cases, you can perform tasks with R that would normally require a loop in 
other languages, but without actually using a loop. For example, say that you cre- 
ate a vector: 


IntValues <- c(1, 4, 5, 9, 2) 
The vector contains the integers shown in the list. However, now you need to turn 
them into a decimal value by multiplying each value by 0.1. When working with 
other languages, you need to create a loop, retrieve each value individually, per- 
form the multiplication, and store the value in a new variable. When working with 
R, you can simplify the task as shown here: 


DecValues <- IntValues x 0.1 


Note that no loop is involved. You simply multiply the vector by 0.1. When you 
display the values in DecValues, you see the following output: 


[1] 0.1 0.40.50.90.2 
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Working with functions 


Functions provide the means for packing code that you intend to use more than 
once. You can package the code and turn it into a black box, which is essentially an 
entity to which you provide inputs and expect certain outputs; however, you don’t 
have to worry about how it (the black box) works. Using a black box reduces pro- 
gram complexity. To create a function, you use the function() function, followed 
by a block of code you want to execute. A function can receive inputs through 
arguments and provide a return value based on the task it performs. The following 
code shows a basic function that accepts arguments and provides a return value: 


LessThan <- function(Value1, Value2) 


{ 
if(Value1 < Value2) 
{ 
result <- TRUE 
} 
else 
{ 
result <- FALSE 
} 
} 


In this case, LessThan() requires two arguments as input. The function compares 
the first argument to the second argument. When the first argument is less than 
the second argument, the function returns TRUE; otherwise, it returns FALSE. 
Making a call, such as print(LessThan(1, 2)), outputs [1] TRUE. 


You might think that LessThan( ) works only with numbers. However, you can use 
it to compare all sorts of data. For example, when you call print(LessThan("G", 
"H")), you get an output of [1] TRUE. However, if you reverse the two arguments, 
you get an output of [1] FALSE instead. You can compare dates and all sorts of 
other data types using this simple function. Of course, the real-world functions 
you create will be more complex, but this simple example gives you an idea of 
what’s possible and how you can potentially extend the functionality of functions 
you create. 


Finding mean and median 


R provides built-in functions to find mean and median. The mean( ) function has 
the following syntax: 


mean(x, trim=0, na.rm=FALSE) 
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where 


>> x: Contains the input vector 


>> trim: Determines the number of observations to drop from both ends of the 
sorted vector 


>> na.rm: Specifies whether to remove the missing values from the input vector 


If you create a vector of samples, such as MySamples <- c(19, 4, 5, 7, 29, 19, 
29, 13, 25, 19, 42), the output of mean(MySamples) is [1] 19.18182. 


Finding the median is just as easy. In this case, you use the median( ) function that 
has the following syntax: 


median(x, na.rm=FALSE) 


Using the same vector of samples and calling median(MySamples) produces an 
output of [1] 19. 


R doesn’t provide a built-in function for finding mode. You must instead create a 
function of your own design or use one of the external packages available on the 
market. The external packages represent the least amount of work. To find mode, 
you must install and use such an external package. The following code shows how 
to find the mode using the modeest 2.1 package: 


install .packages("modeest" ) 
library(modeest ) 


MySamples <- c(19, 4, 5, 7, 29, 19, 29, 13, 25, 19, 42) 
mlv(MySamples, method = "mfv") 


When you call install .packages(), you might see a warning message that R 
can’t write the package information to the default directory. The install. 
packages() function then suggests creating a personal library, which is the route 
you should take. The install.packages() function then asks you to choose a 
mirror for downloading the package. Choose the mirror that is closest to your 
location. 


Simply installing the package doesn’t let you use it. You must use the library() 
function to read the package into memory. 


The mlv() function provides an estimation of mode. The library provides forms of 
mlv() for use in plotting and printing as well. To see all these other options, use 
the library(help = "modeest") function call. In this case, mlv() uses the most 
frequent value (mfv) method to calculate mode. The output from this example is 
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FIGURE 5-1: 
A simple example 
of a bar chart. 


Mode (most likely value): 19 
Bickel's modal skewness: @ 
Call: mlv.default(x = MySamples, method = "mfv") 


Charting your data 


R gives you access to all the typical charts and plots: pie, bar, box, histogram, line, 
and scatter. Add-on libraries provide you with even more options. Many of the 
examples in the book show graphic output, so this section is relatively short. In 
this case, you begin by creating a vector with sample data, MySamples <- c(19, 
4, 5, 7, 29, 19, 29, 18, 25, 19, 42). Calling barplot(MySamples) creates 
the output shown in Figure 5-1. 
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Of course, this is the most basic output you can create. R provides you with a num- 
ber of methods for modifying the output and making it look the way you want it 
to look. The basic barplot() function syntax is as follows: 


barplot(H,xlab,ylab,main, names.arg,col ) 
where 


>» H: Supplies a vector or matrix containing numeric values used in the bar chart 
>> xlab: Defines a label for the x-axis 

>» ylab: Defines a label for the y-axis 

>> main: Specifies a title for the bar chart 

>> names.arg: Supplies a vector of names that appear under each bar 


22 col: Contains a list of colors to use to color the bars in the graph 
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IN THIS CHAPTER 


Determining which Python 
distribution to use for machine 
learning 


Performing a Linux, Mac OS X, and 
Windows installation 


Obtaining the datasets and example 
code 


Installing a Python 
Distribution 


© 


REMEMBER 


efore you can do too much with Python or use it to solve machine learning 

problems, you need a workable installation. In addition, you need access to 

the datasets and code used for this book. Downloading the sample code 
(found at www .dummies .com/extras/machinelearning) and installing it on your 
system is the best way to get a good learning experience from the book. This 
chapter helps you get your system set up so that you can easily follow the exam- 
ples in the remainder of the book. 


Using the downloadable source code doesn’t prevent you from typing the exam- 
ples on your own, following them using a debugger, expanding them, or working 
with the code in all sorts of ways. The downloadable source code is there to help 
you get a good start with your machine learning and Python learning experience. 
After you see how the code works when it’s correctly typed and configured, you 
can try to create the examples on your own. If you make a mistake, you can com- 
pare what you’ve typed with the downloadable source code and discover precisely 
where the error exists. You can find the downloadable source for this chapter in 
the ML4D; @6; Sample.ipynb and ML4D; @6; Dataset Load.ipynb files. (The 
Introduction tells you where to download the source code for this book.) 
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USING PYTHON 2.7.X FOR THIS BOOK 


There are currently two parallel Python developments. Most books rely on the newest 
version of a language for examples. Actually, however, two newest versions of Python 
exist that you can use as of this writing: 2.7.11 and 3.5.1. Python is unique in that some 
groups use one version and other groups use the other version. Because data scientists 
and others who perform machine learning tasks mainly use the 2.7.x version of Python, 
this book concentrates on that version. (Eventually, all development tasks will move to 
the 3.x version of the product.) Using the 2.7.x version means that you're better able to 
work with other people who perform machine learning tasks when you complete this 
book. If the book used the 3.5.x version instead, you might find it hard to understand 
examples that you see in real-world applications. 


If you truly want to use the 3.5.x version with this book, you can do so, but you need to 
understand that the examples may not always work as written. For example, when using 
the Python 2.7 print() function, you don't absolutely need to include parentheses. The 
Python 3.5 version of the same function raises an error unless you do use the parenthe- 
ses. Even though it seems like a minor difference, it's enough to cause confusion for 
some people, and you need to keep it in mind as you work through the examples. 


Fortunately, you can find a number of online sites that document the version 2.7 
and version 3.5 differences. One of the easiest sites to understand is nbviewer at 
http: //nbviewer . ipython. org/github/rasbt/python_reference/blob/ 
master /tutorials/key_differences_between_python_2_and_3.ipynb. 
Another good place to look is Spartan Ideas athttp: //spartanideas.msu. 
edu/201 4/06/01 /the-key-di f ferences—between-python-2-7-x-and 
python-3-x-with-examples/. These sites will help you if you choose to use 
version 3.5 with this book. However, the book supports only version 2.7, and you 
use version 3.5 at your own risk. Make sure to keep an eye out for changes as 
machine learning evolves on the book's blog athttp: //blog. johnmuel lerbooks. 
com/category/technical /machine—learning-—for—dummies/. The blog will 
help you make any adjustments should they be needed to keep current. 
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It’s entirely possible to obtain a generic copy of Python and add all the required 
machine learning libraries to it. The process can be difficult because you need to 
ensure that you have all the required libraries in the correct versions to guarantee 
success. In addition, you need to perform the configuration required to make sure 
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WARNING 


that the libraries are accessible when you need them. Fortunately, going through 
the required work is not necessary because a number of Python machine learning 
products are available for you to use. These products provide everything needed to 
get started with machine learning projects. 


You can use any of the packages mentioned in the following sections to work with 
the examples in this book. However, the book’s source code and downloadable 
source code rely on Continuum Analytics Anaconda because this particular pack- 
age works on every platform this book supports: Linux, Mac OS X, and Windows. 
The book doesn’t mention a specific package in the chapters that follow, but any 
screenshots reflect how things look when using Anaconda on Windows. You may 
need to tweak the code to use another package, and the screens will look different 
if you use Anaconda on some other platform. 


Windows 10 presents some serious installation issues when working with Python. 
You can read about these issues on my blog at http: //blog. johnmuel lerbooks. 
com/2015/10/30/python-and-windows-1@/. Given that so many readers of my 
other Python books have sent feedback saying that Windows 10 doesn’t provide a 
good environment, it’s hard to recommend Windows 10 as a Python platform for 
this book. If you’re working with Windows 10, simply be aware that your road to a 
Python installation will be a rocky one. 


Getting Continuum Analytics Anaconda 


The basic Anaconda package is a free download that you obtain at https: //store. 
continuum. io/cshop/anaconda/. Simply click Download Anaconda to obtain 
access to the free product. You do need to provide an email address to get a copy 
of Anaconda. After you provide your email address, you go to another page, where 
you can choose your platform and the installer for that platform. Anaconda 
supports the following platforms: 


2» Windows 32-bit and 64-bit (the installer may offer you only the 64-bit or 32-bit 
version, depending on which version of Windows it detects) 


>> Linux 32-bit and 64-bit 
>> Mac OS X 64-bit 


The default download version installs Python 2.7, which is the version used in this 
book (see the “Using Python 2.7.x for this book” sidebar for details). You can also 
choose to install Python 3.5 by clicking one of the links in the Python 3.5 part of 
the page. Both Windows and Mac OS X provide graphical installers. When using 
Linux, you rely on the bash utility. 
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TIP 


You can obtain Anaconda with older versions of Python. If you want to use an 
older version of Python, click the installer archive link near the bottom of the 
page. You should use an older version of Python only when you have a pressing 
need to do so. 


The Miniconda installer can potentially save time by limiting the number of 
features you install. However, trying to figure out precisely which packages you do 
need is an error-prone and time-consuming process. In general, you want to per- 
form a full installation to ensure that you have everything needed for your proj- 
ects. Even a full install doesn’t require much time or effort to download and install 
on most systems. 


The free product is all you need for this book. However, when you look on the site, 
you see that many other add-on products are available. These products can 
help you create robust applications. For example, when you add Accelerate to the 
mix, you obtain the capability to perform multicore and GPU-enabled operations. 
The use of these add-on products is outside the scope of this book, but the Ana- 
conda site provides details on using them. 


Getting Enthought Canopy Express 


Enthought Canopy Express is a free product for producing both technical and 
scientific applications using Python. You can obtain it athttps : //www.enthought. 
com/canopy-express/. Click Download Free on the main page to see a listing of 
the versions that you can download. Only Canopy Express is free; the full Canopy 
product comes at a cost. However, you can use Canopy Express to work with the 
examples in this book. Canopy Express supports the following platforms: 


>> Windows 32-bit and 64-bit 
>> Linux 32-bit and 64-bit 
>> Mac OS X 32-bit and 64-bit 


Choose the platform and version you want to download. When you click Download 
Canopy Express, you see an optional form for providing information about your- 
self. The download starts automatically, even if you don’t provide personal infor- 
mation to the company. 


One of the advantages of Canopy Express is that Enthought is heavily involved in 
providing support for both students and teachers. People also can take classes, 
including online classes, that teach the use of Canopy Express in various ways (see 
https: //training.enthought.com/courses). 
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Also offered is live classroom training designed specifically for the data scientist; 
read about this training at https: //www.enthought.com/services/training/ 
data-science. Taking the data science classes won’t teach you the details of 

TIP working through machine learning problems, but they help you understand how 
to work with big data, which is a part of working through machine learning prob- 
lems. In short, knowing data science gives you a boost in using Python for machine 
learning, but it doesn’t completely remove the learning curve. 


Getting pythonxy 


The pythonxy Integrated Development Environment (IDE) is a community project 
hosted on Google at http: //python-xy. github. io/. It’s a Windows-only prod- 
uct, so you can’t easily use it for cross-platform needs. (In fact, it supports only 
Windows Vista, Windows 7, and Windows 8.) However, it does come with a full set 
of libraries, and you can easily use it for this book if you want. 


Because pythonxy uses the GNU General Public License (GPL) v3 (see http: //www. 
gnu.org/licenses/gp1.html), you have no add-ons, training, or other paid 
features to worry about. No one will come calling at your door hoping to sell you 
something. In addition, you have access to all the source code for pythonxy, so you 
can make modifications if you want. 


Getting WinPython 


The name tells you that WinPython is a Windows-only product that you can find 
at http: //winpython.sourceforge.net/. This product is actually a spin-off of 
pythonxy and isn’t meant to replace it. Quite the contrary: WinPython is simply a 
more flexible way to work with pythonxy. You can read about the motivation for 
creating WinPython at http: //sourceforge.net/p/winpython/wiki/Roadmap/. 


The bottom line for this product is that you gain flexibility at the cost of friendli- 
ness and a little platform integration. However, for developers who need to main- 
tain multiple versions of an IDE, WinPython may make a significant difference. 
When using WinPython with this book, make sure to pay particular attention to 
configuration issues or you’ll find that even the downloadable code has little 
chance of working. 


Installing Python on Linux 


You use the command line to install Anaconda on Linux — there is no graphical 
installation option. Before you can perform the install, you must download a copy 
of the Linux software from the Continuum Analytics site. You can find the required 
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download information in the “Getting Continuum Analytics Anaconda” section, 
earlier in this chapter. The following procedure should work fine on any Linux 
system, whether you use the 32-bit or 64-bit version of Anaconda: 


1. Open a copy of Terminal. 
The Terminal window appears. 
2. Change directories to the downloaded copy of Anaconda on your system. 


The name of this file varies, but normally it appears as Anaconda2-2. 
4.1-Linux-x86.sh for 32-bit systems and Anaconda2-2.4.41-Linux-x86_64. 
sh for 64-bit systems. The version number is embedded as part of the filename. 
In this case, the filename refers to version 2.4.1, which is the version used for 
this book. If you use some other version, you may experience problems with 
the source code and need to make adjustments when working with it. 


3. Type bash Anaconda2-2.4.1-Linux-x86.sh (for the 32-bit version) or 
Anaconda2-2.4.1-Linux-x86_64.sh (for the 64-bit version) and press Enter. 


An installation wizard starts that asks you to accept the licensing terms for 
using Anaconda. 


4. Read the licensing agreement and accept the terms using the method 
required for your version of Linux. 


The wizard asks you to provide an installation location for Anaconda. The book 
assumes that you use the default location of ~/anaconda. If you choose some 
other location, you may have to modify some procedures later in the book to 
work with your setup. 


5. Provide an installation location (if necessary) and press Enter (or click Next). 


The application extraction process begins. After the extraction is complete, you 
see a completion message. 


6. Add the installation path to your PATH statement using the method required 
for your version of Linux. 


You're ready to begin using Anaconda. 


Installing Python on Mac OS X 


The Mac OS X installation comes in only one form: 64-bit. Before you can perform 
the install, you must download a copy of the Mac software from the Continuum 
Analytics site. You can find the required download information in the “Getting 
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Continuum Analytics Anaconda” section, earlier in this chapter. The following 
steps help you install Anaconda 64-bit on a Mac system: 


1. 


Locate the downloaded copy of Anaconda on your system. 


The name of this file varies, but normally it appears as Anaconda2-2.4. 
4-MacOSX-x86_64. pkg. The version number is embedded as part of the 
filename. In this case, the filename refers to version 2.4.1, which is the version 
used for this book. If you use some other version, you may experience 
problems with the source code and need to make adjustments when working 
with it. 


Double-click the installation file. 
An introduction dialog box appears. 
Click Continue. 


The wizard asks whether you want to review the Read Me materials. You can 
read these materials later. For now, you can safely skip the information. 


Click Continue. 


The wizard displays a licensing agreement. Be sure to read through the 
licensing agreement so that you know the terms of usage. 


Click | Agree if you agree to the licensing agreement. 


The wizard asks you to provide a destination for the installation. The destina- 
tion controls whether the installation is for an individual user or a group. 


You may see an error message stating that you can't install Anaconda on the 
system. The error message occurs because of a bug in the installer and has 
nothing to do with your system. To get rid of the error message, choose the 
Install Only for Me option. You can't install Anaconda for a group of users on a 
Mac system. 


Click Continue. 


The installer displays a dialog box containing options for changing the installa- 
tion type. Click Change Install Location if you want to modify where Anaconda 
is installed on your system. (The book assumes that you use the default path of 
~/anaconda.) Click Customize if you want to modify how the installer works. 

For example, you can choose not to add Anaconda to your PATH statement. 
However, the book assumes that you have chosen the default install options, 
and no good reason exists to change them unless you have another copy of 
Python 2.7 installed somewhere else. 
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7. Click Install. 


The installation begins. A progress bar tells you how the installation process is 
progressing. When the installation is complete, you see a completion dialog box. 


8. Click Continue. 
You're ready to begin using Anaconda. 


Continuum also provides a command-line version of the Mac OS X installation. 
This file has a filename of Anaconda2-2.4.1-MacOSX-x86_64.sh, and you use the 
bash utility to install it in the same way that you do on any Linux system. How- 
ever, installing Anaconda from the command line gives you no advantage unless 
you need to perform it as part of an automated setup. Using the GUI version, as 
described in this section, is much easier. 


Installing Python on Windows 
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Anaconda comes with a graphical installation application for Windows, so getting 
a good install means using a wizard, as you would for any other installation. Of 
course, you need a copy of the installation file before you begin, and you can find 
the required download information in the “Getting Continuum Analytics Ana- 
conda” section, earlier in this chapter. The following procedure should work fine 
on any Windows system, whether you use the 32-bit or the 64-bit version of 
Anaconda: 


1. Locate the downloaded copy of Anaconda on your system. 


The name of this file varies, but normally it appears as Anaconda2-2.4. 
41-Windows-x86. exe for 32-bit systems and Anaconda2-2 . 4.1—-Windows-x86_64. 
exe for 64-bit systems. The version number is embedded as part of the filename. In 
this case, the filename refers to version 2.4.1, which is the version used for this book. 
If you use some other version, you may experience problems with the source code 
and need to make adjustments when working with it. 


2. Double-click the installation file. 


(You may see an Open File - Security Warning dialog box that asks whether 
you want to run this file. Click Run if you see this dialog box pop up.) You see 
an Anaconda 2.4.1 Setup dialog box similar to the one shown in Figure 6-1. The 
exact dialog box that you see depends on which version of the Anaconda 
installation program you download. If you have a 64-bit operating system, 
using the 64-bit version of Anaconda is always best so that you obtain the best 
possible performance. This first dialog box tells you when you have the 64-bit 
version of the product. 
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D Anaconda? 2.4.1 (64-bit) Setup fol] eS) 


Welcome to the Anaconda2 2.4.1 
(64-bit) Setup Wizard 


This wizard will guide you through the installation of 
Anaconda? 2.4, 1 (64-bit), 


It is recommended that you dose all other applications 
before starting Setup. This will make it possible to update 
relevant system files without having to reboot your 
computer. 


Click Next to continue. 


FIGURE 6-1: 

The setup 
process begins by CONTINUUM 
telling you ; 
whether you have 
the 64-bit version. 





3. Click Next. 


The wizard displays a licensing agreement. Be sure to read through the 
licensing agreement so that you know the terms of usage. 


4. Click! Agree if you agree to the licensing agreement. 


You're asked what sort of installation type to perform, as shown in Figure 6-2. 
In most cases, you want to install the product just for yourself. The exception is 
if you have multiple people using your system and they all need access to 
Anaconda. 





D Anaconda? 2.4.1 (64-bit) Setup fol] 
Pe Select Installation Type 
6 ANACONDA Please select the type of installation you would like to perform for 


Anaconda2 2.4. 1 (64-bit). 
Install for: 


©) Just Me (recommended) 


All Users (requires admin privileges) 


FIGURE 6-2: 

Tell the wizard 

how to install 
Anaconda on 


your system. 
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FIGURE 6-3: 
Specify an 
installation 
location. 


FIGURE 6-4: 
Configure the 
advanced 
installation 
options. 


5: 


Choose one of the installation types and then click Next. 


The wizard asks where to install Anaconda on disk, as shown in Figure 6-3. The 
book assumes that you use the default location. If you choose some other 
location, you may have to modify some procedures later in the book to work 
with your setup. 








Choose Install Location 
ANACONDA Choose the folder in which to install Anaconda? 2.4.1 (64-bit). 


Setup will install Anaconda2 2.4. 1 (64-bit) in the following folder. To install in a different 
folder, dick Browse and select another folder, Click Next to continue. 


Destination Folder 





C:\Users\John\Anaconda2 


Space required: 353.9MB 
Space available: 321. 1GB 


Choose an installation location (if necessary) and then click Next. 


You see the Advanced Installation Options, shown in Figure 6-4. These options 
are selected by default, and no good reason exists to change them in most 
cases. You might need to change them if Anaconda won't provide your default 
Python 2.7 (or Python 3.5) setup. However, the book assumes that you've set 
up Anaconda using the default options. 


a Advanced 
$ ) ANACONDA Customize how Anaconda integrates with Windows 





Advanced Options 


[V] Add Anaconda to my PATH environment variable 

This ensures that PATH is set correctly when using Python, IPython, 
conda, and any other program in the Anaconda distribution, 

If unchecked, then you must use the Anaconda Command Prompt 
(located in the Start Menu under “Anaconda (64-bit)’). 


[V] Register Anaconda as my default Python 2.7 

This will allow other programs, such as Python Tools for Visual Studio 
PyCharm, Wing IDE, PyDev, and MSI binary packages, to automatically 
detect Anaconda as the primary Python 2.7 on the system. 
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A WORD ABOUT THE SCREENSHOTS 


As you work your way through the book, you use an IDE of your choice to open the 
Python and Python Notebook files containing the book's source code. Every screenshot 
that contains IDE-specific information relies on Anaconda because Anaconda runs on all 
three platforms supported by the book. The use of Anaconda doesn't imply that it's the 
best IDE or that the authors are making any sort of recommendation for it; Anaconda 
simply works well as a demonstration product. 


When you work with Anaconda, the name of the graphical (GUI) environment, Jupyter 
Notebook, is precisely the same across all three platforms, and you won't even see any 
significant difference in the presentation. (jupyter Notebook is a recent evolution of 
IPython, so you may see online resources refer to IPython Notebook.) The differences 
that you do see are minor, and you should ignore them as you work through the book. 
With this in mind, the book does rely heavily on Windows 7 screenshots. When working 
on a Linux, Mac OS X, or other Windows version platform, you should expect to see 
some differences in presentation, but these differences shouldn't reduce your ability to 
work with the examples. 


7. Change the advanced installation options (if necessary) and then click Install. 


You see an Installing dialog box with a progress bar. The installation process 
can take a few minutes, so get yourself a cup of coffee and read the comics for 
a while. When the installation process is over, you see a Next button enabled. 


8. Click Next. 
The wizard tells you that the installation is complete. 
9. Click Finish. 


You're ready to begin using Anaconda. 


Downloading the Datasets 
and Example Code 


This book is about using Python to perform machine learning tasks. Of course, you 
can spend all your time creating the example code from scratch, debugging it, and 
only then discovering how it relates to machine learning, or you can take the easy 
way and download the prewritten code at http://www.dummies.com/extras/ 
machinelearning so that you can get right to work. Likewise, creating datasets 
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large enough for machine learning purposes would take quite a while. Fortunately, 
you can access standardized, precreated datasets quite easily using features pro- 
vided in some of the data science libraries (which also work just fine for machine 
learning). The following sections help you download and use the example code 
and datasets so that you can save time and get right to work with data science- 
specific tasks. 


Using Jupyter Notebook 


To make working with the relatively complex code in this book easier, you use 
Jupyter Notebook. This interface lets you easily create Python notebook files that 
can contain any number of examples, each of which can run individually. The 
program runs in your browser, so which platform you use for development doesn’t 
matter; as long as it has a browser, you should be okay. 


Starting Jupyter Notebook 


Most platforms provide an icon to access Jupyter Notebook. Just open this icon to 
access Jupyter Notebook. For example, on a Windows system, you choose Start 
All Programs Anaconda Jupyter Notebook. Figure 6-5 shows how the interface 
looks when viewed in a Firefox browser. The precise appearance on your system 
depends on the browser you use and the kind of platform you have installed. 


@ Home - Mozilla Firefox 
File Edit View History Bookmarks Tools Help 


J Z Home x 





a 95 localhost:8888/tree c|U + @ 4 At ©- 


Z yupyter 


Files Running Clusters 


Select items to perform actions on them. Upload |New © E 


z 
~ # 


© Anaconda2 


(© AppData 

FIGURE 6-5: TASR 

Jupyter Notebook © Desktop 
provides an easy REA 


method to create 
machine learning 
examples. 


© Downloads 
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If you have a platform that doesn’t offer easy access through an icon, you can use 
these steps to access Jupyter Notebook: 
1. Open a Command Prompt or Terminal Window on your system. 
The window opens so that you can type commands. 
2. Change directories to the \Anaconda2\Scripts directory on your machine. 
Most systems let you use the CD command for this task. 
3. Type ..\python ipython2-script.py notebook and press Enter. 


The Jupyter Notebook page opens in your browser. 


Stopping the Jupyter Notebook server 


No matter how you start Jupyter Notebook (or just Notebook, as it appears in the 
remainder of the book), the system generally opens a command prompt or termi- 
nal window to host Jupyter Notebook. This window contains a server that makes 
the application work. After you close the browser window when a session is com- 
plete, select the server window and press Ctrl+C or Ctrl+Break to stop the server. 


Defining the code repository 


The code you create and use in this book will reside in a repository on your hard 
drive. Think of a repository as a kind of filing cabinet where you put your code. 
Notebook opens a drawer, takes out the folder, and shows the code to you. You can 
modify it, run individual examples within the folder, add new examples, and sim- 
ply interact with your code in a natural manner. The following sections get you 
started with Notebook so that you can see how this whole repository concept 
works. 


Defining the book's folder 


It pays to organize your files so that you can access them easier later. This book 
keeps its files in the ML4D (Machine Learning for Dummies) folder. Use these 
steps within Notebook to create a new folder. 


1. Choose New © Folder. 


Notebook creates a new folder named Untitled Folder, as shown in Figure 6-6. 
The file will appear in alphanumeric order, so you may not initially see it. You 
must scroll down to the correct location. 
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Check the box next to the Untitled Folder entry. 


w 


Click Rename at the top of the page. 


You see a Rename Directory dialog box like the one shown in Figure 6-7. 
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Rename directory 


Enter a new directory name: 


FIGURE 6-7: 
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4. Type ML4D and click OK. 
Notebook changes the name of the folder for you. 
5. Click the new ML4D entry in the list. 


Notebook changes the location to the ML4D folder where you perform tasks 
related to the exercises in this book. 
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FIGURE 6-8: 

A notebook 
contains cells that 
you use to hold 
code. 


Creating a new notebook 


Every new notebook is like a file folder. You can place individual examples within 
the file folder, just as you would sheets of paper into a physical file folder. Each 
example appears in a cell. You can put other sorts of things in the file folder, too, 
but you see how these things work as the book progresses. Use these steps to cre- 
ate a new notebook: 


1. Click New Python 2. 


A new tab opens in the browser with the new notebook, as shown in Figure 6-8. 
Notice that the notebook contains a cell and that Notebook has highlighted the 
cell so that you can begin typing code in it. The title of the notebook is Untitled 
right now. That's not a particularly helpful title, so you need to change it. 


@ Untitled - Mozilla Firefox 
File Edit View History Bookmarks Tools Help 
> ML4D/ x | > Untitled 


A S localhost:8888/noteboc tled ¢ +AA 


Z JUpyter Untitled isutossves 


File Ed View Insert Cell Kernel 





+t x & BR * y AME C Code |] Cell Toolbar: None 

















2. Click Untitled on the page. 
Notebook asks what you want to use as a new name, as shown in Figure 6-9. 
3. Type ML4D; 06; Sample and press Enter. 


The new name tells you that this is a file for Machine Learning For Dummies, 
Chapter 6, Sample.ipynb. Using this naming convention will let you easily dif- 
ferentiate these files from other files in your repository. 
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name for your 
notebook. 








Of course, the Sample notebook doesn’t contain anything just yet. Place the cursor 
in the cell, type print ’Python is really cool!’, and then click the Run button (the 
button with the right- pointing arrow on the toolbar). You see the output shown in 
Figure 6-10. The output is part of the same cell as the code. However, Notebook 
visually separates the output from the code so that you can tell them apart. Note- 
book automatically creates a new cell for you. 


xJ > ML4D; 06; Sample x \ 
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In [1]: | print "Python is really cool!" 


Python is really cool! 
FIGURE 6-10: 
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When you finish working with a notebook, shutting it down is important. To close 
a notebook, choose File Close and Halt. You return to the Home page, where you 
can see the notebook you just created added to the list, as shown in Figure 6-11. 
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FIGURE 6-11: 
Any notebooks 
you create 
appear in the 
repository list. 
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Exporting a notebook 


Creating notebooks and keeping them all to yourself isn’t much fun. At some point, 
you want to share them with other people. To perform this task, you must export 
your notebook from the repository to a file. You can then send the file to someone 
else, who will import it into his or her repository. 


The previous section shows how to create a notebook named ML4D; 06; Sample. 
You can open this notebook by clicking its entry in the repository list. The file 
reopens so that you can see your code again. To export this code, choose File Down- 
load As>IPython Notebook. What you see next depends on your browser, but you 
generally see some sort of dialog box for saving the notebook as a file. Use the same 
method for saving the IPython Notebook file as you use for any other file you save 
using your browser. 


Removing a notebook 


Sometimes notebooks get outdated or you simply don’t need to work with them 
any longer. Rather than allow your repository to get clogged with files you don’t 
need, you can remove these unwanted notebooks from the list. Use these steps to 
remove the file: 
1. Select the box next to the ML4D; 06; Sample.ipynb entry. 
2. Click the trash can icon (Delete) at the top of the page. 

You see a Delete notebook warning message like the one shown in Figure 6-12. 


3. Click Delete. 


The file gets removed from the list. 
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Notebook warns 


removing any 
files from the 
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repository. 


Importing a notebook 


To use the source code from this book, you must import the downloaded files into 
your repository. The source code comes in an archive file that you extract to a 
location on your hard drive. The archive contains a list of . ipynb (IPython Note- 
book) files containing the source code for this book (see the Introduction for 
details on downloading the source code). The following steps tell how to import 
these files into your repository: 

1. Click Upload at the top of the page. 


What you see depends on your browser. In most cases, you see some type of 
File Upload dialog box that provides access to the files on your hard drive. 


2. Navigate to the directory containing the files that you want to import into 
Notebook. 


3. Highlight one or more files to import and click the Open (or other, similar) 
button to begin the upload process. 


You see the file added to an upload list, as shown in Figure 6-13. The file isn’t 
part of the repository yet — you've simply selected it for upload. 


4. Click Upload. 


Notebook places the file in the repository so that you can begin using it. 


Understanding the datasets 
used in this book 


This book uses a number of datasets, all of which appear in the Scikit-learn 
library. These datasets demonstrate various ways in which you can interact with 


PART 2 Preparing Your Learning Tools 


@ ML4D/ - Mozilla Firefox cra | 
File Edit View History Bookmarks Tools Help 


J > mani x \ 





a S é localhost:8888/tree/ML4D# (< + ® AUA 22# @ > = 


Z yupyter 


Files Running Clusters 
Select items to perform actions on them. Upload | New~ © 


~ ®& / ML4D 
FIGURE 6-13: 


The files that you 
want to add to 
the repository } MLAD; 06; Sample ipynb Cancel 

appear as part of 
an upload list. 





data, and you use them in the examples to perform a variety of tasks. The follow- 
ing list provides a quick overview of the function used to import each of the data- 
sets into your Python code: 


22 load_boston( ): Regression analysis with the Boston house-prices dataset 
>> load_iris(): Classification with the iris dataset 

> load_diabetes( ): Regression with the diabetes dataset 

>> load_digits([n_class] ): Classification with the digits dataset 

2» fetch_2@newsgroups(subset='train’) : Data from 20 newsgroups 


>» fetch_olivetti_faces( ): Olivetti faces dataset from AT&T 


The technique for loading each of these datasets is the same across examples. The 
following example shows how to load the Boston house-prices dataset. You can 
find the code in the ML4D; 06; Dataset Load. ipynb notebook. 


from sklearn.datasets import load_boston 
Boston = load_boston() 


print Boston.data.shape 


To see how the code works, click Run Cell. The output from the print call is 
(5@6L, 13L). You can see the output shown in Figure 6-14. 
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from sklearn.datasets import load boston 
Boston = load boston () 
print Boston.data.shape 
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IN THIS CHAPTER 


Performing tasks with numbers and 
logic 


Interacting with strings 
Working with dates 
Packaging tasks using functions 


Making decisions and repeating 
tasks 


Managing data using sets, lists, and 
tuples 


Accessing individual items using 
iterators 


Making data easier to access using 
indexed storage 


Packaging code elements using 
modules 


Coding in Python 
Using Anaconda 


hapter 6 helps you create a Python installation that you can use for a variety 

of purposes, including machine learning and data science. This chapter’s 

focus is on providing a good overview of how Python works as a language 
as well as to solve machine learning problems. If you already know Python, you 
might be able to skip this chapter and move on to the next one. It might be best, 
however, to at least skim this chapter and test some of the examples, just to be 
sure that you have a good installation. 
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REMEMBER 


If you’re new to Python, you can’t use this book to learn Python from scratch — 
for that, you need a book such as Beginning Programming with Python For Dummies, 
by this book’s coauthor John Mueller (Wiley), or you need a tutorial such as the 
one at https://docs.python.org/2/tutorial/. This chapter assumes that 
you’ve worked with other programming languages and have at least an idea of 
how Python works. This limitation aside, you can use this chapter for a refresher 
on how things work in Python, which is all that many people really need. You find 
out the basics of working with Python data types, coding structures, and data 
manipulation strategies. 


This book uses Python 2.7.x. The latest version of Python as of this writing is 
version 2.7.11. If you try to use this book with Python 3.5.1 (or above), you may 
need to modify the examples to compensate for the differences in that version. 
The “Using Python 2.7.x for this book” sidebar in Chapter 6 gives you details about 
Python version differences. Going through the examples in this chapter will help 
you know what to expect with other examples in the book should you decide to use 
version 3.5.1 when solving machine learning problems. 


Working with Numbers and Logic 
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Machine learning involves working with data of various sorts, but much of the 
work involves numbers. In addition, you use logical values to make decisions 
about the data you use. For example, you might need to know whether two values 
are equal or one value is greater than another value. Python supports these num- 
ber and logic value types: 


>> Any whole number is an integer. For example, the value 1 is a whole 
number, so it’s an integer. On the other hand, 1.0 isn’t a whole number; it has 
a decimal part to it, so it’s not an integer. Integers are represented by the int 
data type. On most platforms, you can store numbers between 
-9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 within an int 
(which is the maximum value that fits in a 64-bit variable). 


>> Any number that includes a decimal portion is a floating-point value. For 
example, 1.0 has a decimal part, so it's a floating-point value. Many people get 
confused about whole numbers and floating-point numbers, but the differ- 
ence is easy to remember. If you see a decimal in the number, it’s a floating- 
point value. Python stores floating-point values in the float data type. The 
maximum value that a floating-point variable can contain is 
+1.7976931348623157 x 102° and the minimum value that a floating point 
variable can contain is +2.2250738585072014 x 1008 on most platforms. 
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>> A complex number consists of a real number and an imaginary number 
that are paired together. Just in case you've completely forgotten about 
complex numbers, you can read about them at http: //www.mathsisfun. 
com/numbers/complex—-numbers .htm1. The imaginary part of a complex 
number always appears with a after it. So, if you want to create a complex 
number with 3 as the real part and 4 as the imaginary part, you make an 
assignment like this: myComplex = 3 + 4j. 


>» Logical arguments require Boolean values, which are named after 
George Bool. When using a Boolean value in Python, you rely on the bool 
type. A variable of this type can contain only two values: True or False. You 
can assign a value by using the True or False keywords, or you can create an 
expression that defines a logical idea that equates to true or false. For 
example, you could saymyBool = 1 > 2, which would equate to false 
because 1 is most definitely not greater than 2. 


Now that you have the basics down, you can move on to see the data types in 
action. The next sections provide a quick overview of how you can work with both 
numeric and logical data in Python. 


STYLISTIC CONCERNS WITH PYTHON 


Python developers designed Python to be easy to read and understand. For this reason, 
it comes with certain style conventions. You can see these conventions listed in the 
Pep-8 style guide (https : //www. python. org/dev/peps/pep-@008/). If you want to 
exchange your code with someone else or use it in a public venue, you need to follow 
the conventions relatively closely. However, personal code or example code that you 
create need not follow the conventions precisely. 


You must use the whitespace rules when writing your code because Python uses them to 
determine where code segments begin and end. In addition, Python has some odd rules 
that might seem randomly implemented but make the code easier to work with. For 
example, you can’t mix tabs and spaces in the same document to create whitespace when 
working with Python 3. (Python 2 does allow mixing of tabs and spaces.) The preference is 
to use spaces, and the Python editors on the market tend to use spaces by default. 


Some stylistic concerns are more about preference than about making the code work. 
For example, method names are supposed to use all lowercase letters, and to use under- 
scores to separate words, such as my_method. However, you can use camel case when 
desired, such as myMethod, or even pascal case, such as MyMethod, and the code will 
compile just fine. If you want to make a method private, however, you must use a leading 


(continued) 
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(continued) 


underscore for it, such as __my_method. Adding two underscores, such as __my_method, 
invokes Python’s name mangling to make it harder (albeit not impossible) for someone 
to use the method. The point is that you don't have to be a slave to the stylistic rules as 
long as you're willing to live with the consequences of not following them completely. 


Python does include magic words, such as__init__,__import__, and__file__. You 
don't ever create these names yourself, and you use only the existing magic words as 
defined by Python. A list of these magic words appears athttp: //www.rafekettler . 
com/magicmethods .htm1. The guide tells you the most common uses for the magic 
words as well. 


Performing variable assignments 


When working with applications, you store information in variables. A variable is 


a kind of storage box. Whenever you want to work with the information, you 


access it using the variable. If you have new information that you want to store, 
you put it in a variable. Changing information means accessing the variable first 


and then storing the new value in the variable. Just as you store things in boxes in 


the real world, so you store things in variables (a kind of storage box) when work- 
ing with applications. To store data in a variable, you assign the data to it using 
any of a number of assignment operators (special symbols that tell you how to store 


the data). Table 7-1 shows the assignment operators that Python supports. (The 


example column assumes that MyVar starts with a value of 5 when such a value is 
needed.) 














TABLE 7-1 Python Assignment Operators 

Operator Description Example 

= Assigns the value found in the right operand to the MyVar = 2 results in MyVar 
left operand containing 2 

+= Adds the value found in the right operand to the value MyVar += 2 results in MyVar 
found in the left operand and places the result in the containing 7 
left operand 

-= Subtracts the value found in the right operand from the MyVar — 2 results in MyVar 
value found in the left operand and places the result in containing 3 
the left operand 

*= Multiplies the value found in the right operand by the MyVar *= 2 results in MyVar 
value found in the left operand and places the result in containing 10 


the left operand 





112 PART2 Preparing Your Learning Tools 





























Operator Description Example 
/= Divides the value found in the left operand by the value MyVar /= 2 results in MyVar 
found in the right operand and places the result in the containing 2.5 
left operand 
%= Divides the value found in the left operand by the value MyVar %= 2 results in MyVar 
found in the right operand and places the remainder in containing 1 
the left operand 
A= Determines the exponential value found in the left MyVar **= 2 results in MyVar 
operand when raised to the power of the value foundin containing 25 
the right operand and places the result in the 
left operand 
//= Divides the value found in the left operand by the value MyVar //= 2 results in MyVar 
found in the right operand and places the integer (whole containing 2 
number) result in the left operand 
Doing arithmetic 
Storing information in variables makes it easily accessible. However, to perform 
any useful work with the variable, you usually perform some type of arithmetic 
operation on it. Python supports the common arithmetic operators that you use to 
perform tasks manually. These operators appear in Table 7-2. 
TABLE 7-2 Python Arithmetic Operators 
Operator Description Example 
+ Adds two values together 5+2=7 
- Subtracts the right operand from left operand 5-2=3 
j Multiplies the right operand by the left operand 5*2=10 
$ Divides the left operand by the right operand 5/2=2.5 
% Divides the left operand by the right operand and returns 5%2=1 
the remainder 
EF Calculates the exponential value of the right operand by the 5 **2=25 
left operand 
II Performs integer division, in which the left operand is divided 5//2=2 


by the right operand and only the whole number is returned 
(also called floor division) 
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Sometimes you need to interact with just one variable. Python supports a number 
of unary operators, those that work with just one variable, as shown in Table 7-3. 











TABLE 7-3 Python Unary Operators 
Operator Description Example 
~ Inverts the bits in a number so that all of the 0 bits ~4 results in a value of -5 
become 1 bits and vice versa. 
- Negates the original value so that positive becomes -(-4) results in 4 and -4 
negative and vice versa. results in -4 
+ Is provided purely for the sake of completeness. This +4 results in a value of 4 
operator returns the same value that you 
provide as input. 
Computers can perform other sorts of math tasks because of the way in which the 
processor works. It’s important to remember that computers store data as a series 
of individual bits. Python lets you access these individual bits using bitwise 
operators, as shown in Table 7-4. 
TABLE 7-4 Python Bitwise Operators 
Operator Description Example 
& (And) Determines whether both individual bits within two 0b1100 & 060110 = 0b0100 
operators are true and sets the resulting bit to True 
when they are. 
| (Or) Determines whether either of the individual bits 0b1100 | 0b0110 = 0b1110 


within two operators are true and sets the resulting 
bit to True when they are. 





^ (Exclusive or) 





0b1100 ^ 0b0110 = 061010 





Determines whether just one of the individual bits 
within two operators is true and sets the resulting bit 
to True when one is. When both bits are true or both 
bits are false, the result is False. 





~ (One's 
complement) 


Calculates the one’s complement value of a number. ~0b1100 = -0b1101 


~0b0110 = -0b0111 








<< (Left shift) Shifts the bits in the left operand left by the value of 0600110011 << 2 = 0b11001100 
the right operand. All new bits are set to 0 and all bits 
that flow off the end are lost. 

>> (Right shift) Shifts the bits in the left operand right by the value of 0600110011 >> 2 = 0600001100 


the right operand. All new bits are set to 0 and all bits 
that flow off the end are lost. 
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Comparing data using Boolean 
expressions 


Using arithmetic to modify the content of variables is a kind of data manipulation. 
To determine the effect of data manipulation, a computer must compare the cur- 
rent state of the variable against its original state or the state of a known value. In 
some cases, detecting the status of one input against another is also necessary. All 
these operations check the relationship between two variables, so the resulting 


operators are relational operators, as shown in Table 7-5. 


TABLE 7-5 Python Relational Operators 
Operator Description 


== Determines whether two values are equal. Notice that the relational operator 
uses two equals signs. A mistake that many developers make is to use just 
one equals sign, which results in one value being assigned to another. 


Example 


1 == 2 is False 














l= Determines whether two values are not equal. Some older versions of 1 !=2is True 
Python allowed you to use the <> operator in place of the != operator. Using 
the <> operator results in an error in current versions of Python. 
> Verifies that the left operand value is greater than the right operand value. 1 >2is False 
< Verifies that the left operand value is less than the right operand value. 1<2is True 
>= Verifies that the left operand value is greater than or equal to the right 1 >= 2 is False 


operand value. 





<= Verifies that the left operand value is less than or equal to the right 
operand value. 


1 <= 2 is True 


Sometimes a relational operator can’t tell the whole story of the comparison of 
two values. For example, you might need to check a condition in which two sepa- 
rate comparisons are needed, such as MyAge > 40 and MyHeight < 74. The need 
to add conditions to the comparison requires a logical operator of the sort shown 
in Table 7-6. 


TABLE 7-6 Python Logical Operators 
Operator Description Example 
and Determines whether both operands are true. True and True is True 


True and False is False 
False and True is False 


False and False is False 





(continued) 
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TABLE 7-6 (continued) 





Operator Description Example 
or Determines when one of two operands is true. True or True is True 
True or False is True 
False or True is True 
False or False is False 
not Negates the truth value of a single operand. A true not True is False 
value becomes false and a false value becomes true. 
not False is True 
Computers provide order to comparisons by making some operators more signifi- 
cant than others. The ordering of operators is operator precedence. Table 7-7 shows 
the operator precedence of all the common Python operators, including a few you 
haven’t seen as part of a discussion yet. When making comparisons, always con- 
sider operator precedence; otherwise, the assumptions you make about a com- 
parison outcome will likely be wrong. 
TABLE 7-7 Python Operator Precedence 


Operator Description 


() 


You use parentheses to group expressions and to override the default 
precedence so that you can force an operation of lower precedence (such as 
addition) to take precedence over an operation of higher precedence (such as 
multiplication). 
































+E Exponentiation raises the value of the left operand to the power of the 
right operand. 

~+- Unary operators interact with a single variable or expression. 

*/%// Multiply, divide, modulo, and floor division. 

+- Addition and subtraction. 

>> << Right and left bitwise shift. 

& Bitwise AND. 

^ | Bitwise exclusive OR and standard OR. 

<=<>>= Comparison operators. 

== [= Equality operators. 

= %= /= //= -=+=*=**= Assignment operators. 
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Operator Description 








is Identity operators. 

is not 

in Membership operators. 
notin 

not or and Logical operators. 


Creating and Using Strings 


Of all the data types, strings are the most easily understood by humans and not 
understood at all by computers. A string is simply any grouping of characters that 
you place within double quotation marks. For example, myString = ‘Python is 
a great language.” assigns a string of characters to myString. 


The computer doesn’t see letters at all. Every letter you use is represented by a 
number in memory. For example, the letter A is actually the number 65. To see 
this for yourself, type ord(“A”) at the Python prompt and press Enter. You see 65 
as output. You can convert any single letter to its numeric equivalent using the 
ord() command. 


Because the computer doesn’t really understand strings, but strings are so useful 
in writing applications, you sometimes need to convert a string to a number. You 
can use the int() and float() commands to perform this conversion. For example, 
if you type myInt = int(“123”) and press Enter at the Python prompt, you create 
an int named myInt that contains the value 123. 


You can convert numbers to a string as well by using the str() command. For 
example, if you type myStr = str(1234.56) and press Enter, you create a string 
containing the value "1234.56" and assign it to myStr. The point is that you can 
REMEMBER gO back and forth between strings and numbers with great ease. Later chapters 
demonstrate how these conversions make many seemingly impossible tasks quite 
doable. 





As with numbers, you can use some special operators with strings (and many 
objects). The member operators let you determine when a string contains specific 
content. Table 7-8 shows these operators. 
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TABLE 7-8 Python Membership Operators 





Operator Description Example 

in Determines whether the value in the left operand "Hello" in "Hello 
appears in the sequence found in the right operand Goodbye" is True 

notin Determines whether the value in the left operand is "Hello" not in "Hello 


missing from the sequence found in the right operand Goodbye" is False 


The discussion in this section also indicates that you obviously need to know the 
kind of data that variables contain. You use the identity operators to perform this 
task, as shown in Table 7-9. 





TABLE 7-9 Python Identity Operators 
Operator Description Example 
is Evaluates to true when the type of the value or expressionin type(2) is int is True 
the right operand points to the same type in the 
left operand 
is not Evaluates to true when the type of the value or expressionin type(2) is not int is False 


the right operand points to a different type than the value or 
expression in the left operand 


Interacting with Dates 


Dates and times are items that most people work with quite a bit. Society bases 
almost everything on the date and time that a task needs to be or was completed. 
We make appointments and plan events for specific dates and times. Most of our 
day revolves around the clock. Because of the time-oriented nature of humans, 
it’s a good idea to look at how Python deals with interacting with date and time 
(especially storing these values for later use). As with everything else, computers 
understand only numbers — dates and times don’t really exist. 


To work with dates and times, you must issue a special import datetime com- 
mand. Technically, this act is called importing a module (see the “Storing Code in 
Modules” section, later in the chapter, for details). Don’t worry about how the 

REMEMBER Command works right now — just use it whenever you want to do something with 
a date and time. 


Computers do have clocks inside them, but the clocks are for the humans using 


the computer. Yes, some software also depends on the clock, but again, the 
emphasis is on human needs rather than anything the computer might require. 
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To get the current time, you can simply type datetime.datetime.now() and press 
Enter. You see the full date and time information as found on your computer’s 
clock, such as datetime.datetime(2015, 12, 20, 16, 29, 2, 351000). 


You may have noticed that the date and time are a little hard to read in the existing 
format. Say that you want to get just the current date, and in a readable format. To 
accomplish this task, you access just the date portion of the output and convert it 
into a string. Type str(datetime.datetime.now().date()) and press Enter. Now you 
see the output in a friendlier format, such as '2015-12-20'. 


Interestingly enough, Python also has a time( ) command, which you can use to 
obtain the current time. You can obtain separate values for each of the compo- 
nents that make up date and time using the day, month, year, hour, minute, 
second, and microsecond values. Later chapters help you understand how to use 
these various date and time features to make working with machine learning 
applications easier. 


Creating and Using Functions 


To manage information properly, you need to organize the tools used to perform 
the required tasks. Each line of code that you create performs a specific task, and 
you combine these lines of code to achieve a desired result. Sometimes you need 
to repeat the instructions with different data, and in some cases your code becomes 
so long that keeping track of what each part does is hard. Functions serve as orga- 
nization tools that keep your code neat and tidy. In addition, functions let you 
easily reuse the instructions you’ve created as needed with different data. This 
section of the chapter tells you all about functions. More important, in this section 
you start creating your first serious applications in the same way that professional 
developers do. 


Creating reusable functions 


You go to your closet, take out pants and shirt, remove the labels, and put them on. 
At the end of the day, you take everything off and throw iit in the trash. Hmmm... that 
really isn’t what most people do. Most people take the clothes off, wash them, and 
then put them back into the closet for reuse. Functions are reusable, too. No one 
wants to keep repeating the same task; it becomes monotonous and boring. When 
you create a function, you define a package of code that you can use over and over 
to perform the same task. All you need to do is tell the computer to perform a spe- 
cific task by telling it which function to use. The computer faithfully executes each 
instruction in the function absolutely every time you ask it to do so. 
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When you work with functions, the code that needs services from the function is 
named the caller, and it calls upon the function to perform tasks for it. Much of the 
information you see about functions refers to the caller. The caller must supply 
information to the function, and the function returns information to the caller. 


At one time, computer programs didn’t include the concept of code reusability. As 
a result, developers had to keep reinventing the same code. Before long, someone 
came up with the idea of functions, and the concept has evolved over the years 
until functions have become quite flexible. You can make functions do anything 
you want. Code reusability is a necessary part of applications to 


>> Reduce development time 

>> Reduce programmer error 

>> Increase application reliability 

>? Allow entire groups to benefit from the work of one programmer 
>> Make code easier to understand 


>> Improve application efficiency 


In fact, functions do a whole list of things for applications in the form of reus- 
ability. As you work through the examples in this book, you see how reusability 
makes your life significantly easier. If not for reusability, you’d still be program- 
ming by plugging os and 1s manually into the computer. 


Creating a function doesn’t require much work. To see how functions work, open 
a copy of IPython and type the following code (pressing Enter at the end of each 
line): 


def SayHello(): 
print('Hello There!') 


To end the function, you press Enter a second time after the last line. A function 
begins with the keyword def (for define). You provide a function name, parenthe- 
ses that can contain function arguments (data used in the function), and a colon. 
The editor automatically indents the next line for you. Python relies on whitespace 
to define code blocks (statements that are associated with each other in a 
function). 


You can now use the function. Simply type SayHello() and press Enter. The paren- 
theses after the function name are important because they tell Python to execute 
the function, rather than tell you that you are accessing a function as an object (to 
determine what it is). You see Hello There! as the output from this function. 


PART 2 Preparing Your Learning Tools 


Calling functions 


Functions can accept arguments (additional bits of data) and return values. The 
ability to exchange data makes functions far more useful than they otherwise 
might be. The following sections describe how to call functions in a variety of 
ways to both send and receive data. 


Sending required arguments 


A function can require the caller to provide arguments to it. A required argument 
is a variable that must contain data for the function to work. Open a copy of 
IPython and type the following code: 


def DoSum(Value1, Value2): 
return Valuei + Value2 


You have a new function, DoSum(). This function requires that you provide two 
arguments to use it. At least, that’s what you’ve heard so far. Type DoSum() and 
press Enter. You see an error message telling you that DoSum requires two 
arguments. 


Trying DoSum( ) with just one argument would result in another error message. In 
order to use DoSum(), you must provide two arguments. To see how this works, 
type DoSum(1, 2) and press Enter. You see an output of 3. 


Notice that DoSum( ) always provides an output value of 3 when you supply 1 and 2 
as inputs. The return statement provides the output value. Whenever you see 
return in a function, you know that the function provides an output value. 


Sending arguments by keyword 


As your functions become more complex and the methods to use them do as well, 
you may want to provide a little more control over precisely how you call the func- 
tion and provide arguments to it. Until now, you have used positional arguments, 
which means that you have supplied values in the order in which they appear in 
the argument list for the function definition. However, Python also has a method 
for sending arguments by keyword. In this case, you supply the name of the argu- 
ment followed by an equals sign (=) and the argument value. To see how this 
works, open a copy of IPython and type the following code: 


def DisplaySum(Value1, Value2): 


prine(str ( VallWet)) + + T + esti (ValveZ) + " = " $ 
str((Value1 + Value2))) 
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Notice that the print() function argument includes a list of items to print and 
that those items are separated by plus signs (+). In addition, the arguments are of 
different types, so you must convert them using the str() function. Python lets 
you easily mix and match arguments in this manner. This function also intro- 
duces the concept of automatic line continuation. The print() function actually 
appears on two lines, and Python automatically continues the function from the 
first line to the second. 


Next, it’s time to test DisplaySum( ). Of course, you want to try the function using 
positional arguments first, so type DisplaySum(2, 3) and press Enter. You see the 
expected output of 2 + 3 = 5. Now type DisplaySum(Value2 = 3, Value1 = 2) and 
press Enter. Again, you receive the output 2 + 3 = 5 even though the position of 
the arguments has been reversed. 


Giving function arguments a default value 


Whether you make the call using positional arguments or keyword arguments, the 
functions to this point have required that you supply a value. Sometimes a func- 
tion can use default values when a common value is available. Default values make 
the function easier to use and less likely to cause errors when a developer doesn’t 
provide an input. To create a default value, you simply follow the argument name 
with an equals sign and the default value. To see how this works, open a copy of 
IPython and type the following code: 


def SayHello(Greeting = "No Value Supplied"): 
print(Greeting) 


The SayHello() function provides an automatic value for Greeting when a caller 
doesn’t provide one. When someone tries to call SayHello() without an argu- 
ment, Python doesn’t raise an error; instead, it outputs No Value Supplied. Type 
SayHello() and press Enter to see for yourself. You see the default message. Type 
SayHello(“Howdy!”’) to see a normal response. 


Creating functions with a variable 
number of arguments 


In most cases, you know precisely how many arguments to provide with your 
function. It pays to work toward this goal whenever you can because functions 
with a fixed number of arguments are easier to troubleshoot later. However, 
sometimes you simply can’t determine how many arguments the function will 
receive at the outset. For example, when you create a Python application that 
works at the command line, the user might provide no arguments, the maximum 
number of arguments (assuming there is one), or any number of arguments in 
between. 
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Fortunately, Python provides a technique for sending a variable number of argu- 
ments to a function. You simply create an argument that has an asterisk in front 
of it, such as xVarArgs. The usual technique is to provide a second argument that 
contains the number of arguments passed as an input. To see how this works, 
open a copy of IPython and type the following code: 


def DisplayMulti(ArgCount = @, x*VarArgs): 
print('You passed ' + str(ArgCount) + 
VarArgs) 


arguments.', 


Notice that the print() function displays a string and then the list of arguments. 
Because of the way this function is designed, you can type DisplayMulti() and 
press Enter to see that it’s possible to pass zero arguments. To see multiple argu- 
ments at work, type DisplayMulti(3, Hello’, 1, True) and press Enter. The output 
of ('You passed 3 arguments.', ('Hello', 1, True)) shows that you need 
not pass values of any particular type. 


Working with global and local variables 


As do most modern programming languages, Python provides a means for defin- 
ing global variables (those that are available to everyone within the module) and 
local variables (those that are available only to a specific function or other code 
block). In general, using local variables is more secure and less ambiguous than 
using global variables because anyone can change a global variable, and the loca- 
tion of the global variable is unclear. However, sometimes you need to define 
global variables to create useful code. Here is a code example that demonstrates 
the use of global and local variables: 


mystr = "Hello" 


def printStr(): 
global myStr 
print(myStr ) 
mystr = "There!" 
print(myStr ) 


The first instance of myStr is global because you can access it anywhere within the 
module. When you call printStr(), the code block begins by issuing the global 
statement with myStr to indicate that the code should use the global version of 
myStr with the print() function call that comes next. The code block then defines 
a local copy of myStr with a different value. Now when the code calls print(), it 
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displays the local copy of myStr rather than the global copy. As a result, the output 
from this example is 


Hello 
There! 


Using Conditional and Loop Statements 
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Computer applications aren’t very useful if they perform precisely the same tasks 
the same number of times every time you run them. Yes, they can perform useful 
work, but life seldom offers situations in which conditions remain the same. To 
accommodate changing conditions, applications must make decisions and perform 
tasks a variable number of times. Conditional and loop statements let applications 
perform this task, as described in the sections that follow. 


Making decisions using the if statement 


You use if statements regularly in everyday life. For example, you may say to 
yourself, “If it’s Wednesday, TIl eat tuna salad for lunch.” The Python if state- 
ment is a little less verbose, but it follows precisely the same pattern. To see how 
this works, open a copy of IPython and type the following code: 


def TestValue(Value): 
if Value == 
print('Value equals 5!') 
elif Value == 6: 
print('Value equals 6!') 
else: 
print('Value is something else. ') 
print('It equals ' + str(Value) ) 





Every Python if statement begins, oddly enough, with the word if. When Python 
sees if, it knows that you want it to make a decision. After the word if comes a 
condition. A condition simply states what sort of comparison you want Python to 
make. In this case, you want Python to determine whether Value contains the 
value 5. 


Notice that the condition uses the relational equality operator, ==, and not the 
assignment operator, =. A common mistake that developers make is to use the 
assignment operator rather than the equality operator. 
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The condition always ends with a colon (:). If you don’t provide a colon, Python 
doesn’t know that the condition has ended and will continue to look for additional 
conditions on which to base its decision. After the colon comes any tasks you want 
Python to perform. 


You may need to perform multiple tasks using a single if statement. The elif 
clause lets you add another condition and associated tasks. A clause is an adden- 
dum to a previous condition, which is an if statement in this case. The elif 
clause always provides a condition, just as the if statement does, and it has its 
own associated set of tasks to perform. 


Sometimes you need to do something no matter what the condition might be. In 
this case, you add the else clause. The else clause tells Python to do something 
in particular when the conditions of the if statement aren’t met. 


Notice how indenting is becoming more important as the functions become more 
complex. The function contains an if statement. The if statement contains just 
one print() statement. The else clause contains two print() statements. 


To see this function in action, type TestValue(1) and press Enter. You see the out- 
put from the else clause. Type TestValue(5) and press Enter. The output now 
reflects the if statement output. Type TestValue(6) and press Enter. The output 
now shows the results of the elif clause. The result is that this function is more 
flexible than previous functions in the chapter because it can make decisions. 


Choosing between multiple options 
using nested decisions 


Nesting is the process of placing a subordinate statement within another state- 
ment. You can nest any statement within any other statement in most cases. To 
see how this works, open a copy of IPython and type the following code: 


def SecretNumber(): 
One = int(input("Type a number between 1 and 10: ")) 
Two = int(input("Type a number between 1 and 10: ")) 


if (One >= 1) and (One <= 10): 
if (Two >= 1) and (Two <= 1@): 
print('Your secret number is: ' + str(One * Two)) 
else: 
print("Incorrect second value!") 
else: 
print("Incorrect first value!") 
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In this case, SecretNumber() asks you to provide two inputs. Yes, you can get 
inputs from a user when needed by using the input() function. The int() func- 
tion converts the inputs to a number. 


There are two levels of i f statement this time. The first level checks for the valid- 
ity of the number in One. The second level checks for the validity of the number in 
Two. When both One and Two have values between 1 and 10, SecretNumber() out- 
puts a secret number for the user. 


To see SecretNumber() in action, type SecretNumber() and press Enter. Type 20 
and press Enter when asked for the first input value, and type 10 and press Enter 
when asked for the second. You see an error message telling you that the first 
value is incorrect. Type SecretNumber() and press Enter again. This time, use 
values of 10 and 20. The function tells you that the second input is incorrect. Try 
the same sequence again using input values of 10 and 10. 


Performing repetitive tasks using for 


Sometimes you need to perform a task more than one time. You use the for loop 
statement when you need to perform a task a specific number of times. The for 
loop has a definite beginning and a definite end. The number of times that this 
loop executes depends on the number of elements in the variable you provide. To 
see how the for loop works, open a copy of IPython and type the following code: 


def DisplayMulti(*VarArgs): 
for Arg in VarArgs: 


if Arg.upper() == 'CONT': 

continue 

print('Continue Argument: ' + Arg) 
elif Arg.upper() == 'BREAK': 

break 


print('Break Argument: ' + Arg) 
print('Good Argument: ' + Arg) 


In this case, the for loop attempts to process each element in VarArgs. Notice that 
the loop contains a nested if statement, which tests for two ending conditions. In 
most cases, the code skips the if statement and simply prints the argument. 
However, when the if statement finds the words CONT or BREAK in the input 
values, it performs one of these two tasks: 


>> continue: Forces the loop to continue from the current point of execution 
with the next entry in VarArgs. 


>> break: Stops the loop from executing. 


PART 2 Preparing Your Learning Tools 


TIP 


The keywords can appear capitalized in any way because the upper() function 
converts them to uppercase. The DisplayMulti() function can process any num- 
ber of input strings. To see it in action, type DisplayMulti(’Hello’, ’Goodbye’, 
First’, Last’) and press Enter. You see each of the input strings presented on a 
separate line in the output. Now type DisplayMulti(’Hello’, ’Cont’, ’Goodbye’, 
’Break’, Last’) and press Enter. Notice that the words Cont and Break don’t 
appear in the output because they’ re keywords. In addition, the word Last doesn’t 
appear in the output because the for loop ends before this word is processed. 


Using the while statement 


The while loop statement continues to perform tasks until such time that a con- 
dition is no longer true. As with the for statement, the while statement supports 
both the continue and break keywords for ending the loop prematurely. To see 
how this statement works, open a copy of [Python and type the following code: 


def SecretNumber(): 
GotIt = False 
while GotIt == False: 
One = int(input("Type a number between 1 and 10: ")) 
Two = int(input("Type a number between 1 and 10: ")) 


if (One >= 1) and (One <= 10): 
if (Two >= 1) and (Two <= 10): 
print('Secret number is: ' + str(One +* Two)) 
GotIt = True 
continue 
else: 
print("Incorrect second value!") 
else: 
print("Incorrect first value!") 
print("Try again!") 


This is an expansion of the SecretNumber() function first described in the 
“Choosing between multiple options using nested decisions” section, earlier in 
this chapter. However, in this case, the addition of a while loop statement means 
that the function continues to ask for input until it receives a valid response. 


To see how the while statement works, type SecretNumber() and press Enter. Type 
20 and press Enter for the first prompt. Type 10 and press Enter for the second 
prompt. The example tells you that the first number is wrong and then tells you to 
try again. Try a second time using values of 10 and 20. This time the second number 
is wrong, and you still need to try again. On the third try, use values of 10 and 10. 
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This time you get a secret number. Notice that the use of a continue clause means 
that the application doesn’t tell you to try again. 


Data Using Sets, Lists, and Tuples 


Python provides a host of methods for storing data in memory. Each method has 
advantages and disadvantages. Choosing the most appropriate method for your 
particular need is important. The following sections discuss three common tech- 
niques used for storing data for machine learning needs. 


Creating sets 


Most people have used sets at one time or another in school to create lists of items 
that belong together. These lists then became the topic of manipulation using 
math operations such as intersection, union, difference, and symmetric differ- 
ence. Sets are the best option to choose when you need to perform membership 
testing and remove duplicates from a list. You can’t perform sequence-related 
tasks using sets, such as indexing or slicing. To see how you can work with sets, 
start a copy of IPython and type the following code: 


from sets import Set 
SetA = Set(['Red', 'Blue', '‘Green', 'Black']) 
SetB = Set(['Black', 'Green', 'Yellow', ‘Orange’ ] ) 


Notice that you must import the Set capability into your Python application. The 
module sets contains a Set class that you import into your application in order to 
use the resulting functionality. If you try to use the Set class without first import- 
ing it, Python displays an error message. The book uses a number of imported 
libraries, so knowing how to use the import statement is important. 


Performing operations on sets 


Assuming that you created the sets found in the previous section, you may have 
noticed that each of them has some common elements. To see how these sets are 
similar, create some new sets that rely on math operations for comparison pur- 
poses, as shown in the following code: 


SetX = SetA.union(SetB) 
SetY = SetA. intersection(SetB) 
SetZ = SetA.difference(SetB) 
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To see the results of each math operation, type print ’{0}\n{1}\n{2}’.format(SetX, 
SetY, SetZ) and press Enter. You see one set printed on each line, like this: 


Set(['Blue', 'Yellow', 'Green', '‘Orange', 'Black', 'Red']) 
Set(['Green', 'Black']) 
Set(['Blue', 'Red']) 


The outputs show the results of the math operations: union(), intersection(), 
and di fference( ). (When working with Python 3.5, the output may vary from the 
Python 2.7 output shown. All output in the book is for Python 2.7, so you may see 
differences from time to time when using Python 3.5.) Python’s fancier print for- 
matting can be useful in working with collections such as sets. The format() 
function tells Python which objects to place within each of the placeholders in the 
string. A placeholder is a set of curly brackets ({}) with an optional number in it. 
The escape character (essentially a kind of control or special character), /n, pro- 
vides a newline character between entries. You can read more about fancy format- 
ting at https: //docs.python.org/2/tutorial/inputoutput.html. 


You can also test relationships between the various sets. For example, type SetA. 
issuperset(SetY) and press Enter. The output value of True tells you that SetA is a 
superset of SetY. Likewise, if you type SetA.issubset(SetX) and press Enter, you 
find that SetA is a subset of SetX. 


It’s important to understand that sets are either mutable or immutable. All the 
sets in this example are mutable, which means that you can add or remove 
elements from them. For example, if you type SetA.add(’Purple’) and press Enter, 
SetA receives a new element. If you type SetA.issubset(SetX) and press Enter now, 
you find that SetA is no longer a subset of SetX because SetA has the 'Purple' 
element in it. 


Creating lists 


The Python specification defines a list as a kind of sequence. Sequences simply 
provide some means of allowing multiple data items to exist together in a single 
storage unit, but as separate entities. Think about one of those large mail holders 
you see in apartment buildings. A single mail holder contains a number of small 
mailboxes, each of which can contain mail. Python supports other kinds of 
sequences as well: 


>> Tuples: A tuple is a collection used to create complex list-like sequences. An 
advantage of tuples is that you can nest the content of a tuple. This feature 
lets you create structures that can hold employee records or x-y coordi- 
nate pairs. 
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>» Dictionaries: As with the real dictionaries, you create key/value pairs when 
using the dictionary collection (think of a word and its associated definition). 
A dictionary provides incredibly fast search times and makes ordering data 
significantly easier. 


>> Stacks: Most programming languages support stacks directly. However, 
Python doesn't support the stack, although there's a workaround for that. 
A stack is a last in/first out (LIFO) sequence. Think of a pile of pancakes: You 
can add new pancakes to the top and also take them off of the top. A stack is 
an important collection that you can simulate in Python using a list. 


>> Queues: A queue is a first in/first out (FIFO) collection. You use it to track 
items that need to be processed in some way. Think of a queue as a line at the 
bank. You go into the line, wait your turn, and are eventually called to talk with 
a teller. 


>» Deques: A double-ended queue (deque) is a queue-like structure that lets you 
add or remove items from either end, but not from the middle. You can use a 
deque as a queue or a Stack or any other kind of collection to which you're 
adding and from which you're removing items in an orderly manner (in 
contrast to lists, tuples, and dictionaries, which allow randomized access and 
management). 


Of all the sequences, lists are the easiest to understand and are the most directly 
related to a real-world object. Working with lists helps you become better able to 
work with other kinds of sequences that provide greater functionality and 
improved flexibility. The point is that the data is stored in a list much as you 
would write it on a piece of paper — one item comes after another. The list has a 
beginning, a middle, and an end. As shown in the figure, the items are numbered. 
(Even if you might not normally number them in real life, Python always numbers 
the items for you.) To see how you can work with lists, start a copy of IPython and 
type the following code: 


ListA = [0, 1, 2, 3] 


ListB = [4, 5, 6, 7] 
ListA.extend(ListB) 
ListA 


When you type the last line of code, you see the output of [0, 1, 2, 3, 4, 5, 
6, 7]. The extend() function adds the members of ListB to ListA. Beside 
extending lists, you can also add to them using the append() function. Type 
ListA.append(-5) and press Enter. When you type ListA and press Enter again, 
you see that Python has added -5 to the end of the list. You may find that you 
need to remove items again and you do that using the remove() function. 
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For example, type ListA.remove(-5) and press Enter. When you list ListA again, 
you see that the added entry is gone. 


Lists support concatenation using the plus (+) sign. For example, if you type ListX 
= ListA + ListB and press Enter, you find that the newly created ListX contains 
both ListA and ListB in it with the elements of ListA coming first. 


Creating and using tuples 


A tuple is a collection used to create complex lists, in which you can embed one 
tuple within another. This embedding lets you create hierarchies with tuples. 
A hierarchy could be something as simple as the directory listing of your hard 
drive or an organizational chart for your company. The idea is that you can create 
complex data structures using a tuple. 


Tuples are immutable, which means that you can’t change them. You can create a 
new tuple with the same name and modify it in some way, but you can’t modify 
an existing tuple. Lists are mutable, which means that you can change them. So, a 
tuple can seem at first to be at a disadvantage, but immutability has all sorts of 
advantages, such as being more secure as well as faster. In addition, immutable 
objects are easier to use with multiple processors. To see how you can work with 
tuples, start a copy of IPython and type the following code: 


MyTuple = (1, 2, 3, (4, 5, 6, (7, 8, 9))) 


MyTuple is nested three levels deep. The first level consists of the values 1, 2, 3, 
and a tuple. The second level consists of the values 4, 5, 6, and yet another tuple. 
The third level consists of the values 7, 8, and 9. To see how nesting works, type 
the following code into IPython: 


for Value1 in MyTuple: 
if type(Value1) == int: 
print Valuet 
else: 
for Value2 in Value: 
if type(Value2) == int: 
print "\t", Value2 
else: 
for Value3 in Value2: 
print "\t\t", Value3 
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When you run this code, you find that the values really are at three different levels. 
You can see the indentations showing the level: 


al 
2 
3 
4 
5 
6 
T 
8 
9 


You can perform tasks such as adding new values, but you must do it by adding 
the original entries and the new values to a new tuple. In addition, you can add 
tuples to an existing tuple only. To see how to add tuples , type MyNewTuple = 

TIP MyTuple._add___((10, 11, 12, (13, 14, 15))) and press Enter. MyNewTuple con- 
tains new entries at both the first and second levels, like this: (1, 2, 3, (4, 5, 
6, (7, 8, 9)), 10, 14, 12, (413, 14, 15)). If you were to run the previous 
code against MyNewTuple, you’d see entries at the appropriate levels in the output, 
as shown here: 


al 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 


Defining Useful Iterators 


The chapters that follow use all kinds of techniques to access individual values in 
various types of data structures. For this section, you use two simple lists, defined 
as the following: 
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ListA = ['Orange', 'Yellow', 'Green', 'Brown'] 
ListB la, 2, 8, 4] 


Il 


The simplest method of accessing a particular value is to use an index. For exam- 
ple, if you type ListA[1] and press Enter, you see 'Yellow' as the output. All 
indexes in Python are zero-based, which means that the first entry is 0, not 1. 


Ranges present another simple method of accessing values. For example, if you 
type ListB[1:3] and press Enter, the output is [2, 3]. You could use the range as 
input to a for loop, such as 


for Value in ListB[1:3]: 
print Value 


Instead of the entire list, you see just 2 and 3 as outputs, printed on separate lines. 
The range has two values separated by a colon. However, the values are optional. 
For example, ListB[:3] would output [1, 2, 3]. When you leave out a value, the 
range starts at the beginning or the end of the list, as appropriate. 


Sometimes you need to process two lists in parallel. The simplest method of doing 
this is to use the zip() function. Here’s an example of the zip() function in 
action: 


for Value1, Value2 in zip(ListA, ListB): 
print Valuet, '\t', Value2 


This code processes both ListA and ListB at the same time. The processing ends 
when the for loop reaches the shortest of the two lists. In this case, you see the 
following: 


Orange 1 
Yellow 2 
Green 3 
Brown 4 


This use of iterators is the tip of the iceberg. You see a host of iterator types used 
throughout the book. The idea is to enable you to list just the items you want, 
rather than all the items in a list or other data structure. Some of the iterators used 
in upcoming chapters are a little more complicated than what you see here, but 
this is an important start. 
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Indexing Data Using Dictionaries 


Storing 
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A dictionary is a special kind of sequence that uses a name and value pair. The use 
of a name lets you easily access particular values with something other than a 
numeric index. To create a dictionary, you enclose name and value pairs in curly 
brackets. Create a test dictionary by typing MyDict = {’Orange’:1, ’Blue’:2, 
’Pink’:3} and pressing Enter. 


To access a particular value, you use the name as an index. For example, type 
MyDict[’Pink’] and press Enter to see the output value of 3. The use of dictionar- 
ies as data structures lets you easily access incredibly complex data sets using 
terms that everyone can understand. In many other respects, working with a 
dictionary is the same as working with any other sequence. 


Dictionaries do have some special features. For example, type MyDict.keys() and 
press Enter to see a list of the keys. You can use the values() function to see the 
list of values in the dictionary. 


Code in Modules 


To use a module, you must import it. Python places the module code inline with 
the rest of your application in memory — as if you had created one huge file. 
Neither file is changed on disk; they’re still separate, but the way Python views 
the code is different. 


You have two ways to import modules. Each technique is used in specific 
circumstances: 


>> import: You use the import statement when you want to import an entire 
module. This is the most common method that developers use to import 
modules because it saves time and requires only one line of code. However, 
this approach also uses more memory resources than does the approach of 
selectively importing the attributes you need, which the next paragraph 
describes. 


>> from. ..import: You use the from. . . import statement when you want to 
selectively import individual module attributes. This method saves resources, 
but at the cost of complexity. In addition, if you try to use an attribute that you 
didn’t import, Python registers an error. Yes, the module still contains the 
attribute, but Python can't see it because you didn’t import it. 
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To use the code in a module, Python must be able to locate the module and load it 
into memory. The location information is stored as paths within Python. When- 
ever you request that Python import a module, Python looks at all the files in its 
list of paths to find it. The path information comes from three sources: 


>> Environment variables: Chapter 6 tells you about Python environment 
variables, such as PYTHONPATH, that tell Python where to find modules on disk. 


>> Current directory: Earlier in this chapter, you discovered that you can change 
the current Python directory so that it can locate any modules used by your 
application. 


>> Default directories: Even when you don't define any environment variables 
and the current directory doesn't yield any usable modules, Python can still 
find its own libraries in the set of default directories that are included as part 
of its own path information. 


It’s important to know where Python is looking for modules. Sometimes you 
might encounter an error when Python can’t find a module, even though you 
know the module exists on disk. The following code shows how to determine 
where Python is looking for modules on disk: 


import os 
os.environ['PYTHONPATH' ] 


In some cases, you might find it helpful to change the current directory so Python 
can find specific modules. The following example shows how to perform this task: 


os.chdir("C: \ML4D\Chapter1®" ) 


CHAPTER 7 Coding in Python Using Anaconda 135 


IN THIS CHAPTER 


Performing analytics tasks using SAS 
and SPSS 


Using Weka for knowledge analysis 


Considering support vector 
machines using LIBSVM 


Accessing datasets with huge feature 
lists quickly using Vowpal Wabbit 


Mining data using Knime and 
RapidMiner 


Managing huge datasets using Spark 


Exploring Other 
Machine Learning 
Tools 


his book relies on R and Python to perform machine learning tasks because 

these are currently the two most popular languages that lend themselves to 

the learning process. In addition, these languages help you perform a wide 
variety of generalized tasks. However, they aren’t the only tools at your disposal. 
You might find that you need to perform specific kinds of machine learning tasks 
quickly, which means using a special purpose tool for the task. This chapter pres- 
ents you with some other options — tools that you need to know exist in order to 
make the best use of the machine learning skills that you acquire while using R 
and Python. 
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To present you with as many alternatives as possible, this chapter gives you an 
overview of each tool rather than in-depth coverage. The idea is to help you 
understand what sorts of tasks each tool is good at performing and why you might 
want to add the tool to your toolkit. Some of the tools presented represent the next 
level of complexity beyond R and Python. Other tools exist that are even more 
specialized, and some even require special hardware to use. Machine learning is a 
rapidly growing field that encompasses all sorts of disciplines, so you shouldn’t be 
surprised that so many tools exist. Some tools will work well for you; others won’t. 
You need to consider how a specific tool fits in with your goals for machine 
learning. 


Meeting the Precursors SAS, 
Stata, and SPSS 


As you progress through the book, you discover that statistics plays a huge role in 
machine learning, so discovering the basics of statistics also helps you make 
better use of machine learning. Three commonly used statistics pages are SAS 
(https: //www.sas.com/en_us/home.htm1), Stata (http: //www.stata.com/), and 
SPSS (http: //www-@1.ibm.com/software/analytics/spss/). The three pack- 
ages all perform the same task: enable you to perform statistical analysis. (You 
must also pay for all three packages and none of them are open source.) However, 
the three packages perform the task in different ways and attract attention from 
different groups because of how they work. Table 8-1 compares the three products 
and helps you better understand where they fit within the world of statistics. 











TABLE 8-1 Comparing SAS, Stata, and SPSS 
Characteristic SANS) Stata SPSS 
Ease of Learning Hard Easy Easy (when using GUI) 
Interface Type Programs Commands GUI and commands (the 
command syntax is 
hard to learn) 
Data Complex data processing Simple but strong data Weak data manipulation 
Management on multiple files manipulation of a single file using an Excel-like 
performed using proc sq] using a command interface interface on a single data 
language resulting in SQL file at a time 
queries on SAS data 
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Characteristic SAS Stata SPSS 
Statistical Analysis Support for most major Support for many major Support for most major 
forms of statistical forms of statistical analysis forms of statistical analysis 
analysis including: including: regression, including: regression, 
regression, logistic logistic regression, survival logistic regression, survival 
regression, survival analysis, analysis of analysis, analysis of 
analysis, analysis of variance, factor analysis, variance, factor analysis, 
variance, factor analysis, and some and multivariate analysis 
and multivariate analysis multivariate analysis 
Graphics Considered one of the Considered one of the Moderately powerful 
most powerful but most powerful and graphics using a 
difficult-to-learn graphics easiest-to-learn graphic point-and-click interface, 
environments environments, but lacks but the command syntax 
some graphics editing language is incomplete 
functionality and hard to learn 
Target Audience Power users A combination of power Novice and 
and intermediate users intermediate users 


REDUCING WORKLOAD WITH 
FRAMEWORKS AND TOOLS 


A framework is a kind of environment. Using a framework means using a general- 
purpose software environment to produce specific results by modifying the framework 
in some way. A framework lets you focus more on the problem domain rather than the 
underlying code. Machine learning frameworks address specific problems without 
having to know as much about programming. 


Tools enable you to work more quickly by performing common tasks for you. Some 
tools work inside frameworks to boost the ability of developers to make modifications 
using a common programming paradigm. Here are some machine learning frame- 
works and tools you need to know about: 


© Apache Singa: This is a deep learning framework used extensively for natural lan- 
guage processing and image recognition. The main benefits of this framework are 
that it provides a simple programming model and can work across a cluster of 
machines. The main problem with this product is that it can work slowly and may 
be overkill for some problems. 


© Apache Spark MLlib: This is the Apache Spark's scalable machine learning library 
and is now part of the Spark ecosystem of products; however, it was originally an 
independent contribution from AMPLab, U.C. Berkeley. The main benefit of this 
framework is that it learns extremely fast from data. The main issue is that R and 


(continued) 
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Python provide a larger list of implemented algorithms, yet it’s steadily growing. 
This framework offers a relatively large library of premade solutions. 


Caffe: You use Caffe when expression, speed, and modularity are the prime consid- 
erations for a deep learning framework to use for vision tasks and generally learn- 
ing from images, thanks to its pretrained CaffeNet neural network. The developers 
used C++ for the underlying code, which means that if you don’t understand CH, 
making modifications is nearly impossible, but at least it has Python and MATLAB 
interfaces. However, you can perform deep learning tasks incredibly fast using this 
framework. (You can process 600 million images a day using a single GPU.) 


Google TensorFlow: This framework relies on data flow graphs that define how a 
series of deep learning algorithms process data batches (tensors). The graphs trace 
the flows (movements) of data through the system. You use this framework for 
complex data processing using C+ or Python on CPUs or GPUs. The main disad- 
vantages of this framework are that it requires a lot of processing power and a 
strong knowledge of neural networks. The advantages are that it can handle even 
complex machine learning problems, and modifying the flows to test different 
solutions is relatively easy. 


Oxdata H20: These libraries offer prepackaged routines that you use to perform 
machine learning tasks. The H20 library provides algorithms specifically geared to 
meet business needs. It can directly access any Hadoop Distributed File Store 
(HDFS) using Java, Python, R, and Scala. 


Nervana Neon: One of the directions that machine learning is taking is the use of 
custom hardware and software to perform learning tasks. Neon is an example of a 
framework that can use CPUs, GPUs, or the custom Nervana hardware to perform 
machine learning tasks. This framework relies mostly on Python code with a few 
pieces in C++ to help improve processing speed. The main advantage is that you 
have access to a broader range of processing hardware than any other open source 
offering at the time of this writing. The main disadvantage is that it does have a 
higher-than-usual learning curve. 


Shogun: One of the oldest and most common libraries is Shogun. You can use it 
with all sorts of languages: C+, Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. 


Learning in Academia with Weka 


Weka (http: //www.cs.waikato.ac.nz/ml/weka/ and https://sourceforge. 
net/projects/weka/) is a collection of machine learning algorithms written in 
Java and developed at the University of Waikato, New Zealand. The main purpose 
of Weka is to perform data-mining tasks, and initially, schools used it as a 
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learning tool. Now the tool is included as part of the Pentaho business intelligence 
suite where Weka is used for business intelligence. You can use it for 


>> Association rules 
>> Attribute selection 
>> Clustering 

>> Data preprocessing 
» Data classification 
>»? Data visualization 
>> Regression analysis 


>> Workflow analysis 


The reasons that Weka works especially well in schools is that the Java code runs 
on nearly any platform and you can download Weka free. You can apply Weka 
algorithms directly to a dataset or use Weka within your own Java code, making 
remember the environment extremely flexible. The one downside of Weka is that it tends not 
to work well on really large datasets. To use Weka, you must also install an 
appropriate version of Java on your system (see http: //www.cs.waikato.ac.nz/ 
ml/weka/requirements.html). You can use Weka with any DBMS that Java or a 
third-party Java add-on product supports through Java Database Connectivity 
(JDBC), so you have a wide selection of data sources from which to choose. 


Accessing Complex Algorithms 
Easily Using LIBSVM 


Chapter 17 helps you discover the wonders of support vector machines (SVMs). 
LIBSVM (https: //www.csie.ntu.edu.tw/~cjlin/libsvm/) is a library of SVMs 
that you can use to perform tasks such as the following: 

>> Cross-validation for model selection 

>> Probability estimates 

>» Modeling unbalanced data 


>> Multiclass classification 
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An advantage of LIBSVM is that it relies on extensions to provide various types of 
support. Consequently, you have access to LIBSVM through a huge number of 
languages: Python, R, MATLAB, Perl, Ruby, Weka, Common LISP, CLISP, Haskell, 
OCaml, LabVIEW, PHP, C# .NET, and CUDA. Supporters have also created a wealth 
of tools for the library, including an easy script interface for users who know 
nothing about SVM. (You need Python and gnuplot installed to obtain easy script 
support.) 


If you scroll halfway down the support page, you can find the Graphic Interface 
section, where you can see LIBSVM in action. The example program is a Java 
applet that lets you create data points and then use LIBSVM to interact with them. 
The demonstration lets you try many LIBSVM features visually without actually 
spending time to write code. This applet makes LIBSVM one of the easiest prod- 
ucts to try. 


Running As Fast As Light with 
Vowpal Wabbit 


142 


REMEMBER 


Chapter 12 discusses simple learners, which can be effective, but also quite slow. 
You then work with increasingly complex learners as the book progresses until 
Chapter 18, where you work with ensembles of learners. The point is that machine 
learning can be quite computationally intensive, so finding ways to speed up 
the analysis is important. That’s where Vowpal Wabbit (https: //github.com/ 
JohnLangford/vowpal_wabbit/wiki) comes into play. This open source software 
is all about building fast learners so that you can obtain answers quickly, by 
accessing data directly, example by example (online learning). Initially developed 
at Yahoo Labs, it is currently a Microsoft Research project. 


Vowpal Wabbit relies on sparse gradient descent (initially discussed in Chapter 10, 
but in more depth in Chapters 15 and 18) to perform its work. This is a C++ lan- 
guage product and you must install the Boost library (http://www. boost.org/) to 
use it. However, it does support a Python interface (http: //nbviewer . jupyter. 
org/github/hal3/vowpal_wabbit/blob/master/python/Learning_to_Search. 
ipynb). Among the other things, this product helps you perform tasks such as 
feature hashing, multiclass/multilabel classification, matrix factorization, and 
active learning. Active learning is a kind of interactive learning from data in which 
the algorithm requires certain kinds of examples to approximate the target 
function. 
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Visualizing with Knime and RapidMiner 
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Humans have a terrible time visualizing abstract data, and sometimes machine 
learning output becomes extremely abstract. That’s why many of the examples in 
this book feature graphic output so that you can visualize how the data actually 
appears. Most of the products described in this chapter provide some sort of 
graphic output. However, Knime (https://www.knime.org/) and RapidMiner 
(https: //rapidminer.com/) excel at the task by helping you to easily produce 
high-quality graphics. Their use for various kinds of data mining tasks also dis- 
tinguishes both of these products from other products. 


The pharmaceutical industry relies heavily on Knime to perform both machine 
learning and data-mining tasks by relying on data flows (pipelines) similar in 
function to the ensembles of learners discussed in Chapter 18. The use of a GUI 
makes Knime relatively easy to learn. In fact, Knime relies on one of the most 
popular GUIs available today, Eclipse (https: //eclipse.org/), which is also used 
to support a large number of programming languages, such as Java, C/C++, 
JavaScript, and PHP (among many others available through plug-ins). It also 
integrates well with both Weka and LIBSVM (both of which appear in earlier sec- 
tions of the chapter), so ease of use doesn’t come at the loss of functionality. 


RapidMiner caters more to the needs of business, which uses it for machine learn- 
ing, data mining, text mining, predictive analytics, and business analytics needs. In 
contrast to many products described in this chapter, RapidMiner relies on a client/ 
server model, in which the server appears as a cloud-based Software-as-a-Service 
(SAAS) option. This means that a business can test the environment without mak- 
ing a huge initial investment in either software or hardware. RapidMiner works 
with both R and Python. Companies such as eBay, Intel, PepsiCo, and Kraft Foods 
currently use RapidMiner for various needs. 


A distinguishing characteristic of both these products is that they rely on the 
Extract, Transform, Load (ETL) model. In this model, the process first extracts all 
the data needed from various sources, transforms that data into acommon format, 
and then loads the transformed data into a database for analysis. This process 
appears in Chapter 13 as part of the data preprocessing requirement. However, you 
can find a succinct overview of the process at https: //docs.oracle.com/cd/ 
B19306_01/server .102/b14223/ettover .htm. 
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Dealing with Massive Data by Using Spark 
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Even though most of the examples in this book rely on small datasets (simply to 
avoid hiding techniques by using too much data), the real world relies heavily on 
huge datasets. Imagine trying to wend your way through the enormous data gen- 
erated just by the sales made by Amazon.com every day. The point is that you need 
products that help you manage these huge datasets in a manner that makes them 
easier to work with and faster to process. This is where Spark (http: //spark . 
apache.org/) comes in. It relies on a clustering technique initially mentioned in 
Chapter 10 and explained more fully in Chapter 14. 


The emphasis of Spark is speed. When you visit the site, you’re greeted by statistics, 
such as Spark’s capability to process data a hundred times faster than other prod- 
ucts, such as Hadoop (http: //hadoop . apache. org/) MapReduce (see the tutorial at 
https: //hadoop.apache.org/docs/current/hadoop-mapreduce-client/ 
hadoop-mapreduce-cl ient-core/MapReduceTutorial .htm1) in memory. However, 
Spark also offers flexibility in that it works with Java, Scala, Python, and R, and it 
runs on any platform that supports Apache. You can even run Spark in the cloud if 
you want. 


Spark works with huge datasets, which means that you need to know program- 
ming languages, database management, and other developer techniques to use it. 
This means that the Spark learning curve can be quite high, and you need to 
provide time for developers on your team to learn it. The simple examples at 
http: //spark.apache.org/examples.html give you some ideas of just what is 
involved. Notice that all the examples include some level of coding, so you really 
do need to have programming skills to use this option. 
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Getting Started 
with the 
Math Basics 








IN THIS PART... 


Considering the essentials of machine learning 
functionality 


Ensuring that you get the right results 


Performing tasks with simple learners 


Understanding the math used in machine learning 


IN THIS CHAPTER 


Figuring out why you need a matrix 


Computing with matrix calculus to 
your advantage 


Getting a glance at how probability 
works 


Explaining the Bayesian point of 
view on probability 


Describing observations using 
statistical measures 


Demystifying 
the Math Behind 
Machine Learning 


f you want to implement existing machine learning algorithms from scratch or 

you need to devise new ones, you require a profound knowledge of probability, 

linear algebra, linear programming, and multivariable calculus. You also need 
to know how to translate math into working code, which means having sophisti- 
cated computing skills. This chapter begins by helping you understand the 
mechanics of machine learning math and describes how to translate math basics 
into usable code. 


If you want to apply existing machine learning for practical purposes instead, you 
can leverage existing R and Python software libraries using a basic knowledge of 
math and statistics. In the end, you can’t avoid having some of these skills because 
machine learning has strong roots in both math and statistics, but you don’t need 
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to overdo it. After you get some math basics down, the chapter shows how even 
simple Bayesian principles can help you perform some interesting machine learn- 
ing tasks. 


Even though this introductory book focuses on machine learning experiments 
using R and Python, in the text you still find many references to vectors, matrices, 
variables, probabilities, and their distributions. The book sometimes uses descrip- 
tive statistics as well. Consequently, it helps to know what a mean, a median, and 
a standard deviation are in order to understand what happens under the hood of 
the software you use. This knowledge makes it easier to learn how to use the soft- 
ware better. The last part of the chapter demonstrates how machine learning can 
help you make better predictions, even when you don’t quite have all the informa- 
tion you need. 


Working with Data 
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Machine learning is so appealing because it allows machines to learn from real- 
world examples (such as sales records, signals from sensors, and textual data 
streaming from the Internet) and determine what such data implies. Common 
outputs from a machine learning algorithm are predictions of the future, pre- 
scriptions to act on now, or new knowledge in terms of examples categorized by 
groups. Many useful applications have already become a reality by leveraging 
these results: 


>> Diagnosing hard-to-find diseases 

>> Discovering criminal behavior and detecting criminals in action 

>> Recommending the right product to the right person 

>> Filtering and classifying data from the Internet at an enormous scale 


>> Driving a car autonomously 


The mathematical and statistical basis of machine learning makes outputting 
such useful results possible. Using math and statistics in this way enables the 
algorithms to understand anything with a numerical basis. 


To begin the process, you represent the solution to the problem as a number. For 
example, if you want to diagnose a disease using a machine learning algorithm, 
you can make the response a 1 or a 0 (a binary response) to indicate whether the 
person is ill, with a 1 stating simply that the person is ill. Alternatively, you can 
use a number between 0 and 1 to convey a less definite answer. The value can 
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represent the probability that the person is ill, with o indicating that the person 
isn’t ill and 1 indicating that the person definitely has the disease. 


A machine learning algorithm can provide an answer (predictions) when sup- 
ported by the required information (sample data) and an associated response 
(examples of the predictions that you want to be able to guess). Information can 
include facts, events, observations, counts, measurements, and so on. Any infor- 
mation used as input is a feature or variable (a term taken from statistics). Effec- 
tive features describe the values that relate to the response and help the algorithm 
guess a response using the function it creates given similar information in other 
circumstances. 


There are two types of features: quantitative and qualitative. Quantitative features 
are perfect for machine learning because they define values as numbers (integers, 
floats, counts, rankings, or other measures). Qualitative features are usually labels 
or symbols that convey useful information in a nonnumeric way, a way that you 
can define as more human-like (words, descriptions, or concepts). 


You can find a classic example of qualitative features in the paper “Induction of 
Decision Trees” by John Ross Quinlan (http: //dl.acm.org/citation.cfm?id= 
637969), a computer scientist who contributed to the development of decision 
trees in a fundamental way. Decision trees are one of the most popular machine 
learning algorithms to date. In his paper, Quinlan describes a set of information 
useful for deciding whether to play tennis outside or not, something a machine 
can learn using the proper technique. The set of features described by Quinlan is 
as follows: 


>> Outlook: Sunny, overcast, or rain 
>> Temperature: Cool, mild, hot 
>> Humidity: High or normal 


>> Windy: True or false 


A machine learning algorithm cannot really digest such information. You must 
first transform the information into numbers. Many ways are available to do so, 
but the simplest is one-hot encoding, which turns every feature into a new set of 
binary (values o or 1) features for all its symbolic values. For instance, consider 
the outlook variable, which becomes three new features, as follows: outlook:sunny, 
outlook:overcast, and outlook:rain. Each one will have a numeric value of 1 or 0 
depending on whether the implied condition is present. So when the day is 
sunny, outlook:sunny has a value of 1 and outlook:overcast and outlook:rain have 
a value of 0. 
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In addition to one-hot encoding, you can use a few other techniques to turn 
qualitative features into numbers, especially when a feature is made of words, 
such as a tweet from Twitter, a chunk of text from an online review, or a news 
feed. In the latter part of the book, you have occasion to discuss other ways to 
effectively transform words and concepts into meaningful numbers that are 
understandable by a machine learning algorithm when dealing with textual 
analysis. 


No matter what the information is, for a machine learning algorithm to correctly 
process it, it should always be transformed into a number. 


Creating a matrix 


After you make all the data numeric, the machine learning algorithm requires that 
you turn the individual features into a matrix of features and the individual 
responses into a vector or a matrix (when there are multiple responses). A matrix 
is a collection of numbers, arranged in rows and columns, much like the squares 
in a chessboard. However, unlike a chessboard, which is always square, matrices 
can have a different number of rows and columns. 


By convention, a matrix used for machine learning relies on rows to represent 
examples and columns to represent features. So, as in the example for learning 
the best weather conditions to play tennis, you would construct a matrix that uses 
a new row for each day and columns containing the different values for outlook, 
temperature, humidity, and wind. Typically, you represent a matrix as a series of 
numbers enclosed by square brackets, as shown here: 


11 1 545 1 
X=|46 0 345 2 
7.2 1 754 3 


In this example, the matrix called X contains three rows and four columns, so you 
can say that the matrix has dimensions of 3 by 4 (also written as 3 x 4). To quote 
the number of the rows in a formula, you commonly use the letter n and the letter 
m for the number of columns. Knowing the size of a matrix is fundamental for 
correctly operating on it. 


Operating on a matrix also requires being able to retrieve a number or a portion of 
a matrix for specific calculations. You use indexes, numbers that tell the position 
of an element in a matrix to perform this task. Indexes point out the row and col- 
umn number that correspond to the location of a values of interest. Usually you 
use i for the row index and j for the column index. Both i and j indexes start count- 
ing rows and columns beginning with the number 0 (0-indexed) or 1(1-indexed). 
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R matrices are 1-indexed, whereas matrices in Python are 0-indexed. The use of 
different index starting points can prove confusing, so you need to know how the 
language operates. 


When viewing the example matrix, the element 2,3 is the element located in the 
second row intersecting with the third column; that is, 345 (assuming that the 
matrix is 1-index). Therefore, if you would need to express three different ele- 
ments of matrix X, you can use the following notation: 


X1,14=1.1, X2,3=345, X3,4=3 


Sometimes multiple matrices are stacked in slices of a more complex data struc- 
ture called an array. An array is a collection of numeric data having more than two 
dimensions. As an example, you can have three-dimensional arrays where each 
matrix represents a different timeframe, and the matrices are then stacked 
together as the slices of a cake. A case for such an array happens when you con- 
tinuously record medical data, perhaps from a scanner recording body functions 
such as brain activity. In this case, rows are still examples and columns are 
features — with the third dimension representing time. 


A matrix that has a single feature is a special case called a vector. Vector is a term 
used in different scientific disciplines, from physics, to medical disciplines, to 
mathematics, so some confusion may arise depending on your previous expertise. 
In machine learning, a vector is simply a matrix of dimension n by 1, and that’s all. 


Vectors are lists of consecutive values. You have to represent and treat them as 
single columns when compared to a two-dimensional matrix. When working with 
a vector, you have a single positional index, i, which tells you the location of the 
value you want to access in the element sequence. You mostly use vectors when 
talking about response values (response vector) or when dealing with the internal 
coefficients of some algorithms. In this case, you call them a vector of coefficients. 


44 
y =| 21 | y, =44, y: =21, y; =37 
37 


In machine learning, the matrix of features usually appears as X and the corre- 
sponding vector of responses as y. More generally, matrices usually use a capital 
letter and vectors use a lowercase letter for identification. In addition, you use 
lowercase letters for constants, so you need to exercise care when determining 
whether a letter is a vector or a constant, because the set of possible operations is 
quite different. 


You use matrices in machine learning quite often because they allow you to rapidly 
organize, index, and retrieve large amounts of data in a uniform and meaningful way. 
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For every example i in the X feature matrix, you can then determine the i-th row of 
the matrix expressing its features and the i-th element on the response vector telling 
you the results that a specific set of features implies. This strategy allows the algo- 
rithm to look up data and make product predictions quickly. 


Matrix notation also allows you to perform systematic operations on the entire 
matrix or portions of it quickly. Matrices are also useful for writing and executing 
programs in a speedy way because you can use computer commands to execute 
matrix operations. 


Understanding basic operations 


The basic matrix operations are addition, subtraction, and scalar multiplication. 
They are possible only when you have two matrices of the same size and the result 
is a new matrix of the same dimensions. If you have two matrices of the same 
shape, you just have to apply the operation to each corresponding position in the 
two matrices. Therefore, to perform addition, you start summing the values in the 
first row and first column of the two source matrices and place the resulting value 
in the same position of the resulting matrix. You continue the process for each 
paired element in the two matrices until you complete all the operations. The 
same process holds true for subtraction, as shown in the following example: 


1 1 1 0 a 0 1 

1 0j lo 1ıf |1 = 
In scalar multiplication, you instead take a single numeric value (the scalar) and 
multiply it for each element of the matrix. If your value is fractional, such as 1/2 or 


1⁄4, your multiplication turns into division. In the previous example, you can mul- 
tiply the resulting matrix by —2: 


iape > 


You can also perform scalar addition and subtraction. In this case, you add or sub- 
tract a single value from all the elements of a matrix. 


Performing matrix multiplication 


Using indexes and basic matrix operations, you can express quite a few operations 
in a compact way. The combination of indexes and operations allows you to 


>> Slice a part of a matrix 


>> Mask a part of matrix, reducing it to zero 


PART 3 Getting Started with the Math Basics 


TIP 


>> Center the values of a matrix by removing a value from all elements 


>» Rescale the values of a matrix, changing its range of values 


However, you can achieve the largest number of operations at one time only when 
you multiply a matrix against a vector or against another matrix. You perform 
these tasks often in machine learning, and multiplying a matrix by a vector occurs 
frequently. Many machine learning algorithms rely on finding a vector of coeffi- 
cients that, multiplied by the matrix of features, can result in an approximation of 
the vector of response values. In such models, you have formulations like this: 


y = Xb 


where y is the response vector, X the feature matrix and b a vector of coefficients. 
Often, the algorithm also includes a scalar named a to add to the result. In this 
example, you can imagine it as being zero, so it isn’t present. As a result, y is a 
vector constituted by three elements: 


With this in mind, you can express the multiplication between X and b as 


4 5 


3 
Xb = 
v-fe af 2) 
3 3 


To express multiplication when vectors or matrices are involved, the standard 
notation is to write them side by side. Whether you write them within parentheses 
or express them in letter form doesn’t matter. This is a common way to point out 
matrix multiplication (called implicit because there is no sign marking the opera- 
tion). As an alternative, you sometimes find an explicit dot format for the opera- 
tion, such as in A-B. The use of the asterisk is limited to scalar products such as 
A*2 or A*b, where b is a constant. 


Next, you need to know how X multiplied by b can result in y. As a check of being 
able to perform the multiplication, the matrix and the vector involved in the mul- 
tiplication should have compatible sizes. In fact, the number of columns of the 
matrix should equal the number of rows in the vector. In this case, there is a 
match because X is 3 by 2 and b is 2 by 1. Knowing the shapes of the terms, you can 
figure out in advance the shape of the resulting matrix, which is given by the rows 
of the matrix and the columns of the vector, or 3 by 1. 
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Matrix vector multiplication works as a series of summed vector-vector multipli- 
cations. Multiplication treats each row of the X matrix as a vector and multiplies 
it by the b vector. The result becomes the corresponding row element of the 
resulting vector. For instance, the first row [4,5] is multiplied by [3,-2] resulting 
in a vector [12,-10], whose elements summed result in a value of 2. This first 
summed multiplication corresponds to the first row of the resulting vector and 
then all the other calculations follow: 


sum([4*3, 5*-2]) = 2 
sum([2*3, 4-2] ) -2 
sum([3*3, 3*-2]) = 3 


The resulting vector is [2, —2, 3]. Things get a little bit more tricky when multi- 
plying two matrices, but you can perform the operation as a series of matrix- 
vector multiplications, just as in the previous example, by viewing the second 
matrix as a series of feature vectors. By multiplying the first matrix by the m vec- 
tors, you obtain a single column of the resulting matrix for each multiplication. 


An example can clarify the steps in obtaining a matrix by matrix multiplication. 
The following example multiplies X by B, which is a square matrix 2 x 2: 


4 5 
3 -2 

XB=|2 4 
-2 5 

3 3 


You can divide the operation into two distinct matrices by vector multiplications 
by splitting the matrix B into column vectors. 











Now all you have to do is take the resulting column vectors and use them to rebuild 
the output matrix using the multiplication of the first column vector as the first 
column in the new matrix and so on. 


4 5 5 2 17 
XB=|2 4 < |-=|-2 16 
2 =l 

3 3 3 9 
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In matrix multiplication, because of matrix shapes, order matters. Consequently, 
you cannot invert terms, as you would do in a multiplication of scalar numbers. 
Multiplying 5*2 or 2*5 is the same thing, because of the commutative property of 
scalar multiplication, but Ab is not the same as bA because sometimes the multi- 
plication isn’t possible (because the shapes of the matrices are incompatible) or, 
worse, it produces a different result. When you have a series of matrix multiplica- 
tions, such as ABC, the order of the terms doesn’t matter; whether you go with AB 
first or BC first, you get the same result because, like scalars, matrix multiplica- 
tion is associative. 


Glancing at advanced matrix operations 


You may encounter two important matrix operations in some algorithm formula- 
tions. They are the transpose and inverse of a matrix. Transposition occurs when a 
matrix of shape n x m is transformed into a matrix m x n by exchanging the rows 
with the columns. Most tests indicate this operation using the superscript T, as in 
AT. You see this operation used most often for multiplication, in order to obtain 
the right dimensions. 


You apply matrix inversion to matrices of shape m x m, which are square matrices 
that have the same number of rows and columns. This operation is quite important 
because it allows the immediate resolution of equations involving matrix multipli- 
cation, such as y=bX, where you have to discover the values in the vector b. Because 
most scalar numbers (exceptions include zero) have a number whose multiplica- 
tion results in a value of 1, the idea is to find a matrix inverse whose multiplication 
will result in a special matrix called the identity matrix, whose elements are zero, 
except the diagonal elements (the elements in positions where the index I is equal 
to the index j). Finding the inverse of a scalar is quite easy (the scalar number n has 
an inverse of n“ that is 1/n). It’s a different story for a matrix. Matrix inversion 
involves quite a large number of computations, so special math functions perform 
the calculations in R or Python. The inverse of a matrix A is indicated as A". 


Sometimes, finding the inverse of a matrix is impossible. When a matrix cannot 
be inverted, it is referred to as a singular matrix or a degenerate matrix. Singular 
matrices aren’t the norm; they’re quite rare. 


Using vectorization effectively 


If performing matrix operations, such as matrix by vector multiplication, seems a 
bit hard, consider that your computer does all the hard work. All you have to do is 
determine in a theoretical formulation what happens to the numbers as you put 
them into matrices, vectors, and constants, and then you sum, subtract, divide, or 
multiply them. 
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Understanding what happens in a machine learning algorithm will give you an 

edge in using the algorithm because you’ll understand how it digests and pro- 

cesses data. To get a correct result, you need to feed the right data to the right 
TIP algorithm, according to the manner in which it works with data. 


In Python, the NumPy package offers all the functionality needed to create and 
manipulate matrices. The ndarray objects allow fast creation of an array, such as 
a multidimensional matrix, by starting with data queued into lists. 


The term ndarray means “n-dimensional array,” implying that you can create 
arrays of multiple dimensions, not just row-by-column matrices. Using a simple 
a list, ndarray can quickly create a vector, as shown in the Python example here: 
import numpy as np 
y = np.array([44,21,37] ) 
print (y) 
print (y.shape) 


[44 21 37] 
(3,) 


The method shape can promptly inform you about the shape of a matrix. In this 
case, it reports only three rows and no columns, which means that the object is a 
vector. 


To create matrices made of rows and columns, you can use a list of lists. The con- 
tents of the lists inside the main list are the rows of your matrix. 


K = ha.errayn( ot, a, S45, i],(4.6, @, S45, ZI, 
[7-2, 1, 154, S 
print (X) 


[[ 41.4 4. 545. 1. 
[ 46 @. 345. 2. ] 
I va aa vA 3 


You can also obtain the same result by using a single list, which creates a vector 
that you can reshape into the desired number of rows and columns. Numbers are 
filled into the new matrix row by row, starting from the element (0,0) down to the 
last one. 


Me ho.erne( (2.0, 2, S45, Aoa 2, 
7.2, 1, 754, 3]).reshape(3,4) 
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Operations with addition and subtractions with scalars using NumPy ndarray are 
straightforward. You just sum, subtract, multiply, or divide using the standard 
operators: 


a =np.array([[1, 1],[4, @]]) 
b = np.array([[1, @],[@, 1]]) 
print (a = b) 


a = naana Mo, -ki =) 
print (a * -2) 


To perform multiplication on vectors and matrices, you use the np.dot function 
instead. The input for this function is two arrays of compatible sizes to multiply 
according to the given order. 


X = np.array([[4, 5],[2, 4],[3, 3]]) 
b = np.array([3,-2] ) 
print(np.dot(X, b)) 


[aa 2 


B= np.array([[3, -2] , [-2, 5]]) 
print (np.dot(X, B)) 


Switching to R, you can easily recreate the same outputs found in the previous 
examples. You don’t need any additional libraries when using R because R per- 
forms the tasks using standard functions. R defines vectors as a concatenation of 
numbers, whereas matrices require the matrix function. 


y <- c(44, 21, 37) 


X < matrix(c(1.1, 1, 545, 1, 4.6, @, 345, 2, 7.2, 
1, 754, 3), nrow=3, ncol=4, byrow=TRUE) 
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The matrix function requires a bit of explanation, though, because it can turn a 
vector of numbers into a matrix by having you define the number of rows, col- 
umns, or both (defining just one is enough; defining both provides clarity to your 
code). R, contrary to Python, arranges the values by columns, not by rows. There- 
fore you have to set the byrow parameter to TRUE if you want to use the same 
approach as Python; otherwise, arrange your input vector like this: 


X <- matrix(c(1.4, 4.6, 7.2, 1, Ø, 1, 545, 345, 754, 
41, 2, 3), nrow=3, ncol=4) 


You can determine the dimensions of your vectors using the length() function. 
Use the dim() function instead for matrices, because applying length() to a 
matrix just tells you the number of elements in it. 


After defining your matrices and vectors, you can perform multiplication using 
the %«% operator . Using the standard multiplication operator can sometimes cre- 
ate a result, but the output is scalar, rather than matrix multiplication. 


X <- matrix(c(4, 5, 2, 4, 3, 3), nrow=3, byrow=TRUE) 
b <= c(3, -2) 
print (X%*%b) 


B <- matrix(c(3, -2,-2, 5), nrow=2, byrow=TRUE) 
print(X %*% B) 


If you change the previous example to use standard multiplication, Xxb, you see 
that the output is the scalar multiplication of the first column of the matrix by the 
first element in the vector, and of the second column of the matrix by the second 
element of the vector. 


Exploring the World of Probabilities 
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Probability tells you the likelihood of an event, and you express it as a number. 
The probability of an event is measured in the range from o (no probability that 
an event occurs) to 1 (certainty that an event occurs). Intermediate values, such as 
0.25, 0.5, and 0.75, say that the event will happen with a certain frequency when 
tried enough times. If you multiply the probability by an integer number repre- 
senting the number of trials you’re going to try, you’ll get an estimate of how 
many times an event should happen on average if all the trials are tried. 


For instance, if you have an event occurring with probability p=0.25 and you 
try 100 times, you’re likely to witness that event happen 0.25 * 100 = 25 times. 
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This is, for example, the probability of picking a certain suit when choosing a card 
randomly from a deck of cards. French playing cards make a classic example of 
explaining probabilities. The deck contains 52 cards equally divided into four 
suits: clubs and spades, which are black, and diamonds and hearts, which are red. 
So if you want to determine the probability of picking an ace, you must consider 
that there are four aces of different suits. The answer in terms of probability is 
p=4/52=0.077. 


Probabilities are between 0 and 1; no probability can exceed such boundaries. You 
define probabilities empirically from observations. Simply count the number of 
times a specific event happens with respect to all the events that interest you. For 
example, say that you want to calculate the probability of how many times fraud 
happens when doing banking transactions or how many times people get a certain 
disease in a particular country. After witnessing the event, you can estimate the 
probability associated with it by counting the number of times the event occurs 
and dividing by the total number of events. 


You can count the number of times the fraud or the disease happens using recorded 
data (mostly taken from databases) and then divide that figure by the total num- 
ber of generic events or observations available. Therefore, you divide the number 
of frauds by the number of transactions in a year, or you count the number of 
people who fell ill during the year with respect to the population of a certain area. 
The result is a number ranging from 0 to 1, which you can use as your baseline 
probability for a certain event given certain circumstances. 


Counting all the occurrences of an event is not always possible, so you need to 
know about sampling. By sampling, which is an act based on certain probability 
expectations, you can observe a small part of a larger set of events or objects, yet 
be able to infer correct probabilities for an event, as well as exact measures such 
as quantitative measurements or qualitative classes related to a set of objects. 


For instance, if you want to track the sales of cars in the United States for the last 
month, you don’t need to track every sale in the country. Using a sample compris- 
ing the sales from a few car sellers around the country, you can determine quan- 
titative measures, such as the average price of a car sold, or qualitative measures, 
such as the car model sold most often. 


Operating on probabilities 


Operations on probabilities are indeed a bit different from numeric operations. 
Because they always have to be in the range of 0 to 1, you must rely on particular 
rules in order for the operation to make sense. For example, summations between 
probabilities are possible if the events are mutually exclusive (they can’t happen 
together). Say that you want to know the probability of drawing a spade or a 
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diamond from a deck of cards. You can sum the probability of drawing a spade and 
the probability of drawing a diamond this way: p=0.25+0.25=0.5. 


You use subtraction (difference) to determine the probability of events that are 
different from the probability of an event that you have already computed. For 
instance, to determine the probability of drawing a card that isn’t a diamond from 
the deck, you just subtract from the probability of drawing any kind of card, which 
is p=1, the probability of drawing a diamond, like so: p=1-0.25=0.75. You get the 
complement of a probability when you subtract a probability from 1. 


Multiplication helps you compute the intersection of independent events. Inde- 
pendent events are events that do not influence each other. For instance, if you play 
a game of dice and you throw two die, the probability of getting two sixes is 1/6 
(the probability of getting six from the first dice) multiplied by 1/6 (the probability 
of getting six from the second dice), which is p=1/6 * 1/6=0.028. This means that 
if you throw the dice one hundred times, you can expect two sixes to come up only 
two or three times. 


Using summation, difference, and multiplication, you can get the probability of 
most complex situations dealing with events. For instance, you can now compute 
the probability getting at least a six from two thrown dices, which is a summation 
of mutually exclusive events: 


>> The probability of having a two sixes: p=1/6 * 1/6 


>> The probability of having a six on the first dice and something other than a six 
on the second one: p= 1/6 * (1-1/6) 


>> The probability of having a six on the second dice and something other than a 
six on the first one: p= 1/6 * (1-1/6) 


Your probability of getting at least one six from two thrown dice is p=1/6 * 1/6 + 
2 * 1/6 * (1-1/6)=0.306. 


Conditioning chance by Bayes’ theorem 


Probability makes sense in terms of time and space, but some other conditions 
also influence the probability you measure. The context is important. When you 
estimate the probability of an event, you may (sometimes wrongly) tend to believe 
that you can apply the probability you calculated to each possible situation. The 
term to express this belief is a priori probability, meaning the general probability of 
an event. 
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For example, when you toss a coin, if the coin is fair, the a priori probability of a 
head is 50 percent. No matter how many times you toss the coin, when faced with 
a new toss the probability for heads is still 50 percent. However, there are other 
situations in which, if you change the context, the a priori probability is not valid 
anymore because something subtle happened and changed it. In this case, you can 
express this belief as an a posteriori probability, which is the a priori probability 
after something happened to modify the count. 


For instance, the a priori probability of a person’s being female is roughly about 
50 percent. However, the probability may differ drastically if you consider only 
specific age ranges, because females tend to live longer, and after a certain age 
there are more females than males. As another example related to gender, if you 
examine the presence of women in certain faculties at a university, you notice that 
fewer females are engaged in the scientific faculties than males. Therefore, given 
these two contexts, the a posteriori probability is different from the expected a 
priori one. In terms of gender distribution, nature and culture can both create a 
different a posteriori probability. 


You can view such cases as conditional probability, and express it as p(ylx), which 
is read as the probability of event y happening given that x has happened. Conditional 
probabilities are a very powerful tool for machine learning. In fact, if the a priori 
probability can change so much because of certain circumstances, knowing the 
possible circumstances can boost your chances of correctly predicting an event by 
observing examples — exactly what machine learning is intended to do. For 
example, as previously mentioned, generally the expectation of a random per- 
son’s being a male or a female is 50 percent. But what if you add the evidence that 
the person’s hair is long or short? You can estimate the probability of having long 
hair as being 35 percent of the population; yet, if you observe only the female 
population, the probability rises to 60 percent. If the percentage is so high in the 
female population, contrary to the a priori probability, a machine learning algo- 
rithm can benefit from knowing whether the person’s hair is long or short. 


In fact, the Naive Bayes algorithm can really take advantage of boosting the chance 
of making a correct prediction by knowing the circumstances surrounding the 
prediction, as explained in Chapter 12, which covers the first, simplest learners. 
Everything starts with Reverend Bayes and his revolutionary theorem of probabil- 
ities. In fact, one of the machine learning tribes (see the “Specifying the Role of 
Statistics in Machine Learning” section of Chapter 2) is named after him. Also, 
there are great expectations for the development of advanced algorithms based on 
Bayesian probability; MIT’s Technology Review magazine mentioned Bayesian 
machine learning as an emerging technolgy that will change our world (http: // 
www2.technologyreview. com/news/402435/10-emerging-technologies-—that— 
will-change-your/). Yet, the foundations of the theorem aren’t all that complicated 
(although they may be a bit counterintuitive if you normally consider just prior 
probabilities without considering posterior ones). 
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Reverend Thomas Bayes was a statistician and a philosopher who formulated his 
theorem during the first half of the eighteenth century. The theorem was never 
published while he was alive. Its publication revolutionized the theory of proba- 
bility by introducing the idea of conditional probability just mentioned. 


Thanks to Bayes’ theorem, predicting the probability of a person’s being male or 
female becomes easier if the evidence is that the person has long hair. The for- 
mula used by Thomas Bayes is quite useful: 


PUB) = PCEIB)ePRd3) / PCE) 


Reading the formula using the previous example as input can provide a better 
understanding of an otherwise counterintuitive formula: 


>> P(B|E): The probability of a belief (B) given a set of evidence (E) (posterior 
probability). Read “belief” as an alternative way to express a hypothesis. In this 
case, the hypothesis is that a person is a female and the evidence is long hair. 
Knowing the probability of such a belief given evidence can help to predict the 
person's gender with some confidence. 


>> P(E|B): The probability of having long hair when the person is a female. This 
term refers to the probability of the evidence in the subgroup, which is itself a 
conditional probability. In this case, the figure is 60 percent, which translates 
to a value of 0.6 in the formula (prior probability). 


>> P(B): The general probability of being a female; that is, the a priori probability 
of the belief. In this case, the probability is 50 percent, or a value of 0.5 
(likelihood). 


>> P(E): The general probability of having long hair. Here it is another a priori 
probability, this time related to the observed evidence. In this formula, it is a 
35 percent probability, which is a value of 0.35 (evidence). 


If you solve the previous problem using the Bayes formula and the values you have 
singled out, the result is 06. * 0.5 / 0.35 = 0.857. That is a high percentage of like- 
lihood, which leads you to affirm that given such evidence, the person is probably 
a female. 


Another common example, which can raise some eyebrows and is routinely found 
in textbooks and scientific magazines, is that of the positive medical test. It is 
quite interesting for a better understanding of how prior and posterior probabili- 
ties may indeed change a lot under different circumstances. 


Say that you’re worried that you have a rare disease experienced by 1 percent of 
the population. You take the test and the results are positive. Medical tests are 
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never perfectly accurate, and the laboratory tells you that when you are ill, the test 
is positive in 99 percent of the cases, whereas when you are healthy, the test will 
be negative in 99 percent of the cases. 


Now, using these figures, you immediately believe that you’re certainly ill, given 
the high percentage of positive tests when a person is ill (99 percent). However, 
the reality is quite different. In this case, the figures to plug into the Bayes’ theo- 
rem are as follows: 


>> 0.99 as P(E|B) 
>> 0.01 as P(B) 
>> 0.01 * 0.99 + 0.99 *0.01 = 0.0198 as P(E) 


The calculations are then 0.01*0.99 / 0.0198 = 0.5, which corresponds to just a 
50 percent probability that you’re ill. In the end, your chances of not being ill are 
more than you expected. You may wonder how this is possible. The fact is that the 
number of people seeing a positive response from the test is as follows: 


>» Who is ill and gets the correct answer from the test: This group is the true 
positives, and it amounts to 99 percent of the 1 percent of the population who 
gets the illness. 


>> Who isn't ill and gets the wrong answer from the test: This group is the 
1 percent of the 99 percent of the population who gets a positive response 
even though they aren't ill. Again, this is a multiplication of 99 percent and 
1 percent. This group corresponds to the false positives. 


If you look at the problem using this perspective, it becomes evident why, when 
limiting the context to people who get a positive response to the test, the proba- 
bility of being in the group of the true positives is the same as that of being in the 
false positives. 


Describing the Use of Statistics 


As a concluding topic related to probability, it’s important to skim through some 
basic statistical concepts related to probability and statistics and understand how 
they can better help you describe the information used by machine learning algo- 
rithms. Previous sections discuss probability in ways that come in handy because 
sampling, statistical distributions, and statistical descriptive measures are all, in 
one way or another, based on concepts of probability. 


CHAPTER 9 Demystifying the Math Behind Machine Learning 163 


164 


Here, the matter is not simply about how to describe an event by counting its 
occurrences; it’s about describing an event without counting all the times it occurs 
in a reliable way. For example, if you want an algorithm to learn how to detect a 
disease or criminal intent, you have to face the fact that you can’t create a matrix 
comprising all the occurrences of the disease or the crime, so the information that 
you’ll elaborate will be necessarily partial. Moreover, if you measure something in 
the real world, you often don’t get the exact measurements because of some error 
in the procedure, imprecision in the instrument you use, or simply because of a 
random nuisance disturbing the process of recording the measure. A simple mea- 
sure such as your weight, for example, will differ every time you get on the scale, 
slightly oscillating around what you believe to be your true weight. If you were 
take this measurement on a larger scale, such as by weighing all the people who 
live in your city on one huge scale, you could get a picture of how difficult it is to 
measure accurately (because error occurs) and completely (because it is difficult 
to measure everything). 


Having partial information, especially if what you want to describe is quite com- 
plex and variegated, isn’t a completely negative condition, because you can use 
smaller matrices, thereby implying fewer computations. Sometimes you can’t 
even get a sample of what you want to describe and learn for certain problems, 
because the event is complex and has a great variety of features. As another exam- 
ple, consider learning how to determine sentiment from a text taken from Inter- 
net, such as from Twitter tweets. Apart from retweets, you’re unlikely to see an 
identical tweet (expressing the same sentiment using precisely the same words for 
precisely the same topic) by another person in a lifetime. You may happen to see 
something somehow similar, but never identical. Therefore, it’s impossible to 
know in advance all the possible tweets that associate certain words to sentiments. 
In short, you have to use a sample and derive general rules from a partial set. 


Even given such practical restrictions and the impossibility of getting all the pos- 
sible data, you can still grasp what you want to describe and learn from it. Sam- 
pling is a part of the statistical practice. When using samples, you choose your 
examples according to certain criteria. When done carefully, you have a certain 
probability that your partial view resembles the global view well. 


In statistics, population refers to all the events and objects you want to measure, 
and a sample is a part of it chosen by certain criteria. Using random sampling, 
which is picking the events or objects to represent randomly, helps create a set of 
examples for machine learning to learn as it would learn from all the possible 
examples. The sample works because the value distributions in the sample are 
similar to those in population, and that’s enough. 
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Random sampling isn’t the only possible approach. You can also apply stratified 
sampling, through which you can control some aspects of the random sample in 
order to avoid picking too many or too few events of a certain kind. After all, ran- 
dom is random, and you have no absolute assurance of always replicating the 
exact distribution of the population. 


A distribution is a statistical formulation describing how to observe an event or a 
measure by telling you the probability of witnessing a certain value. Distributions 
are described in mathematical formula (a topic not covered in the book) and can 
be graphically described using charts such as histograms or distribution plots. The 
information you put into your matrix has a distribution, and you may find that 
distributions of different features are related. A distribution naturally implies a 
variation, and, when dealing with numeric values, it is important to figure out a 
center of variation, which is often the statistical mean, calculated by summing all 
your values and dividing the sum by the number of values you considered. 


The mean is a descriptive measure, telling you the value to expect the most, con- 
sidering all the possible cases. The mean is best suited for a symmetrical and bell- 
haped distribution (so that when values are above the mean, the distribution is 
similarly shaped as for the values below). A famous distribution, the normal or 
Gaussian distribution, is shaped just like that, but in the real world, you can also 
find many skewed distributions that have extreme values only on one side of the 
distribution, thereby influencing the mean too much. 


The median is a measure that takes the value in the middle after you order all your 
observations from the smallest to the largest. Being based on the value order, the 
median is insensible to values in the distribution and can represent a fairer 
descriptor than the average in certain cases.The significance of the mean and 
median descriptors is that they describe a value in the distribution around which 
there is a variation, and machine learning algorithms do care about such varia- 
tion. Most people call the variation a variance. Because variance is a squared num- 
ber, there is also a root equivalent, termed the standard deviation. Machine learning 
takes into account the variance in every single variable (univariate distributions) 
and in all features together (multivariate distributions) to determine how such 
variation impacts the response. 


In other words, statistics matter in machine learning because they convey the idea 
that features have a distribution. Distribution implies variation, and variation is 
like a quantification of information — the more variance in your features, the 
more information that can be matched to the response in order to draw a rule 
from certain types of information to certain responses. You can then use statistics 
to assess the quality of your feature matrix and even leverage statistical measures 
to build effective machine learning algorithms, as discussed later in the book, 
where matrix operations, sampling from distributions, statistics, and probability 
all contribute to solutions and have computers effectively learn from data. 
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IN THIS CHAPTER 


Understanding how machine 
learning works under the hood 


Recognizing the different parts of a 
learning process 


Defining the most common error 
functions 


Deciding which error function is the 
best for your problem 


Glancing at the steps in machine 
learning optimization when using 
gradient descent 


Distinguishing between batch, 
mini-batch, and online learning 


Descending the 
Right Curve 


achine learning may appear as a kind of magic trick to any newcomer to 
the discipline — something to expect from any application of advanced 
scientific discovery, as Arthur C. Clarke, the futurist and author of popu- 
lar sci-fi stories (one of which became the landmark movie 2001: A Space Odyssey), 
expressed by his third law, stating, “any sufficiently advanced technology is indistin- 
guishable from magic.” However, machine learning isn’t magic at all. It’s the appli- 
cation of mathematical formulations to how we view the human learning process. 


Expecting that the world itself is a representation of mathematical and statistical 
formulations, machine learning algorithms strive to learn about such formula- 
tions by tracking them back from a limited number of observations. Just as you 
don’t need to see all the trees in the world to learn to recognize one (because 
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humans can understand the distinguishing characteristics of trees), so machine 
learning algorithms can use the computational power of computers and the wide 
availability of data about everything to learn how to solve a large number of 
important and useful problems. 


Although machine learning is inherently complex, humans devised it, and in its 
initial inception, it simply started mimicking the way we learn from the world. We 
can express simple data problems and basic learning algorithms based on how a 
child perceives and understands the world, or solve a challenging learning prob- 
lem by using the analogy of descending from the top of a mountain by taking the 
right slope. This chapter helps you understand machine learning as a technology 
rather than as magic. To that end, the following sections offer some basic theory 
and then delve into some simple problems that demonstrate the theory. 


Interpreting Learning As Optimization 


168 


Learning comes in many different flavors, depending on the algorithm and its 
objectives. You can divide machine learning algorithms into three main groups 
based on their purpose: 


>> Supervised learning 
>> Unsupervised learning 


>> Reinforcement learning 


Supervised learning 


Supervised learning occurs when an algorithm learns from example data and asso- 
ciated target responses that can consist of numeric values or string labels, such as 
classes or tags, in order to later predict the correct response when posed with new 
examples. The supervised approach is indeed similar to human learning under the 
supervision of a teacher. The teacher provides good examples for the student to 
memorize, and the student then derives general rules from these specific exam- 
ples. You need to distinguish between regression problems, whose target is a 
numeric value, and classification problems, whose target is a qualitative variable, 
such as a class or a tag. Referring to the examples used in the book, a regression 
task determines the average prices of houses in the Boston area, and a classifica- 
tion tasks distinguishes between kinds of iris flowers based on their sepal and 
petal measures. 
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Unsupervised learning 


Unsupervised learning occurs when an algorithm learns from plain examples with- 
out any associated response, leaving to the algorithm to determine the data pat- 
terns on its own. This type of algorithm tends to restructure the data into 
something else, such as new features that may represent a class or a new series of 
uncorrelated values. They are quite useful in providing humans with insights into 
the meaning of data and new useful inputs to supervised machine learning algo- 
rithms. As a kind of learning, it resembles the methods humans use to figure out 
that certain objects or events are from the same class, such as by observing the 
degree of similarity between objects. Some recommendation systems that you find 
on the web in the form of marketing automation are based on this type of learn- 
ing. The marketing automation algorithm derives its suggestions from what 
you’ve bought in the past. The recommendations are based on an estimation of 
what group of customers you resemble the most and then inferring your likely 
preferences based on that group. 


Reinforcement learning 


Reinforcement learning occurs when you present the algorithm with examples that 
lack labels, as in unsupervised learning. However, you can accompany an example 
with positive or negative feedback according to the solution the algorithm pro- 
poses. Reinforcement learning is connected to applications for which the algo- 
rithm must make decisions (so the product is prescriptive, not just descriptive, as 
in unsupervised learning), and the decisions bear consequences. In the human 
world, it is just like learning by trial and error. Errors help you learn because they 
have a penalty added (cost, loss of time, regret, pain, and so on), teaching you that 
a certain course of action is less likely to succeed than others. An interesting 
example of reinforcement learning occurs when computers learn to play video 
games by themselves. In this case, an application presents the algorithm with 
examples of specific situations, such as having the gamer stuck in a maze while 
avoiding an enemy. The application lets the algorithm know the outcome of 
actions it takes, and learning occurs while trying to avoid what it discovers to be 
dangerous and to pursue survival. You can have a look at how the company Google 
DeepMind has created a reinforcement learning program that plays old Atari’s 
videogames at https: //www. youtube. com/watch?v=V1eYniJORnk. When watch- 
ing the video, notice how the program is initially clumsy and unskilled but steadily 
improves with training until it becomes a champion. 
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The learning process 


Even though supervised learning is the most popular and frequently used of the 
three types, all machine learning algorithms respond to the same logic. The 
central idea is that you can represent reality using a mathematical function that 
the algorithm doesn’t know in advance but can guess after having seen some 
data. You can express reality and all its challenging complexity in terms of 
unknown mathematical functions that machine learning algorithms find and 
make advantageous. This concept is the core idea for all kinds of machine learn- 
ing algorithms. To create clear examples, this chapter focuses on supervised 
classification as the most emblematic of all the learning types and provides 
explanations of its inner functioning that you can extend later to other types of 
learning approaches. 


The objective of a supervised classifier is to assign a class to an example after hav- 
ing examined some characteristics of the example itself. Such characteristics are 
called features, and they can be both quantitative (numeric values) or qualitative 
(string labels). To assign classes correctly, the classifier must first examine a cer- 
tain number of known examples closely (examples that already have a class 
assigned to them), each one accompanied by the same kinds of features as the 
examples that don’t have classes. The training phase involves observation of 
many examples by the classifier that helps it learn so that it can provide an answer 
in terms of a class when it sees an example without a class later. 


To give an idea of what happens in the training process, imagine a child learning 
to distinguish trees from other objects. Before the child can do so in an indepen- 
dent fashion, a teacher presents the child with a certain number of tree images, 
complete with all the facts that make a tree distinguishable from other objects of 
the world. Such facts could be features such as its material (wood), its parts (trunk, 
branches, leaves or needles, roots), and location (planted into the soil). The child 
produces an idea of what a tree looks like by contrasting the display of tree fea- 
tures with the images of other different objects, such as pieces of furniture that 
are made of wood but do not share other characteristics with a tree. 


A machine learning classifier works the same. It builds its cognitive capabilities by 
creating a mathematical formulation that includes all the given features in a way 
that creates a function that can distinguish one class from another. Pretend that a 
mathematical formulation, also called target function, exists to express the charac- 
teristics of a tree. In such a case, a machine learning classifier can look for its 
representation as a replica or as an approximation (a different function that works 
alike). Being able to express such mathematical formulation is the representation 
capability of the classifier. 
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From a mathematical perspective, you can express the representation process in 
machine learning using the equivalent term mapping. Mapping happens when you 
discover the construction of a function by observing its outputs. A successful 
mapping in machine learning is similar to a child internalizing the idea of an 
object. She understands the abstract rules derived from the facts of the world in 
an effective way so that when she sees a tree, for example, she immediately rec- 
ognizes it. 


Such a representation (abstract rules derived from real-world facts) is possible 
because the learning algorithm has many internal parameters (constituted of vec- 
tors and matrices of values), which equate to the algorithm’s memory for ideas 
that are suitable for its mapping activity that connects features to response 
classes. The dimensions and type of internal parameters delimit the kind of target 
functions that an algorithm can learn. An optimization engine in the algorithm 
changes parameters from their initial values during learning to represent the tar- 
get’s hidden function. 


During optimization, the algorithm searches among the possible variants of its 
parameter combinations in order to find the best one that allows the correct 
mapping between the features and classes during training. This process evaluates 
many potential candidate target functions from among those that the learning 
algorithm can guess. The set of all the potential functions that the learning algo- 
rithm can figure out is called the hypothesis space. You can call the resulting clas- 
sifier with all its set parameters a hypothesis, a way in machine learning to say that 
the algorithm has set parameters to replicate the target function and is now ready 
to work out correct classifications (a fact demonstrated later). 


The hypothesis space must contain all the parameter variants of all the machine 
learning algorithms that you want to try to map to an unknown function when 
solving a classification problem. Different algorithms can have different hypoth- 
esis spaces. What really matters is that the hypothesis space contains the target 
function (or its approximation, which is a different but similar function). 


You can imagine this phase as the time when a child, in an effort to figure out her 
own idea of a tree, experiments with many different creative ideas by assembling 
her own knowledge and experiences (an analogy for the given features). Natu- 
rally, the parents are involved in this phase, and they provide relevant environ- 
mental inputs. In machine learning, someone has to provide the right learning 
algorithms, supply some nonlearnable parameters (called hyper-parameters), 
choose a set of examples to learn from, and select the features that accompany the 
examples. Just as a child can’t always learn to distinguish between right and 
wrong if left alone in the world, so machine learning algorithms need human 
beings to learn successfully. 


CHAPTER 10 Descending the Right Curve 171 


Even after completing the learning process, a machine learning classifier often 
can’t univocally map the examples to the target classification function because 
many false and erroneous mappings are also possible, as shown in Figure 10-1. In 
many cases, the algorithm lacks enough data points to discover the right function. 
Noise mixed with the data can also cause problems, as shown in Figure 10-2. 


Response y 


| Aa —— True function 

A lack of evidence . , 

makes it hard to Mapped function(s) 
map back to the 
target function. 


Response y 


FIGURE 10-2: 
Noise can cause 
mismatches in 
the data points. Feature x 





Noise in real-world data is the norm. Many extraneous factors and errors that 
occur when recording data distort the values of the features. A good machine 
learning algorithm should distinguish the signals that can map back to the target 


remember function from extraneous noise. 
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Exploring Cost Functions 


The driving force behind optimization in machine learning is the response from a 
function internal to the algorithm, called the cost function. You may see other terms 
used in some contexts, such as loss function, objective function, scoring function, or 
error function, but the cost function is an evaluation function that measures how 
well the machine learning algorithm maps the target function that it’s striving 
to guess. In addition, a cost function determines how well a machine learning 
algorithm performs in a supervised prediction or an unsupervised optimization 
problem. 


The evaluation function works by comparing the algorithm predictions against 
the actual outcome recorded from the real world. Comparing a prediction against 
its real value using a cost function determines the algorithm’s error level. Because 
it’s a mathematical formulation, the cost function expresses the error level in a 
numerical form, thereby keeping errors low. The cost function transmits what is 
actually important and meaningful for your purposes to the learning algorithm. 
As a result, you must choose, or accurately define, the cost function based on an 
understanding of the problem you want to solve or the level of achievement you 
want to reach. 


As an example, when considering stock market forecasting, the cost function 
expresses the importance of avoiding incorrect predictions. In this case, you want 
to make money by avoiding big losses. In forecasting sales, the concern is differ- 
ent because you need to reduce the error in common and frequent situations, not 
in the rare and exceptional ones, so you use a different cost function. 


When the problem is to predict who will likely become ill from a certain disease, 
you prize algorithms that can score a high probability of singling out people who 
have the same characteristics and actually did become ill later. Based on the sever- 
ity of the illness, you may also prefer that the algorithm wrongly chooses some 
people who don’t get ill after all rather than miss the people who actually do get ill. 


The cost function is what truly drives the success of a machine learning applica- 
tion. It’s as critical to the learning process as representation (the capability to 
approximate certain mathematical functions) and optimization (how the machine 
learning algorithms set their internal parameters). Most algorithms optimize 
their own cost function, and you have little choice but to apply them as they are. 
Some algorithms allow you to choose among a certain number of possible func- 
tions, providing more flexibility. When an algorithm uses a cost function directly 
in the optimization process, the cost function is used internally. Given that algo- 
rithms are set to work with certain cost functions, the optimization objective may 
differ from your desired objective. In such a case, you measure the results using 
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an external cost function that, for clarity of terminology, you call an error function 
or loss function (if it has to be minimized) or a scoring function (if it has to be 
maximized). 


With respect to your target, a good practice is to define the cost function that 
works the best in solving your problem, and then to figure out which algorithms 
work best in optimizing it to define the hypothesis space you want to test. When 

TIP you work with algorithms that don’t allow the cost function you want, you can 
still indirectly influence their optimization process by fixing their hyper-parameters 
and selecting your input features with respect to your cost function. Finally, when 
you’ve gathered all the algorithm results, you evaluate them by using your chosen 
cost function and then decide on the final hypothesis with the best result from 
your chosen error function. 


When an algorithm learns from data, the cost function guides the optimization 
process by pointing out the changes in the internal parameters that are the most 
beneficial for making better predictions. The optimization continues as the cost 
function response improves iteration by iteration. When the response stalls or 
worsens, it’s time to stop tweaking the algorithm’s parameters because the algo- 
rithm isn’t likely to achieve better prediction results from there on. When the 
algorithm works on new data and makes predictions, the cost function helps you 
evaluate whether it’s working properly and is indeed effective. 


Deciding on the cost function is an underrated activity in machine learning. It’s a 
fundamental task because it determines how the algorithm behaves after learning 
and how it handles the problem you want to solve. Never rely on default options, 
rememeer but always ask yourself what you want to achieve using machine learning and 
check what cost function can best represent the achievement. 





In Part 4, you find out about some machine learning algorithms, and in Part 5, you 
see how to apply theory to real problems, introducing classification problems for 
scoring text and sentiments. If you need to pick a cost function, machine learning 

TIP explanations and examples introduce a range of error functions for regression and 
classification, comprising root mean squared errors, log loss, accuracy, precision, 
recall, and area under the curve (AUC). (Don’t worry if these terms aren’t quite 
clear right now; they’re explained in detail in Parts 4 and 5.) 


Descending the Error Curve 


The gradient descent algorithm offers a perfect example of how machine learning 
works, and it sums up the concepts expressed so far in the book because you can 
provide it with an intuitive image, not just a mathematical formulation. Moreover, 
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FIGURE 10-3: 

A plotting of 
parameter data 
against the 
output of the cost 
function. 


though it is just one of many possible methods, gradient descent is a widely used 
approach that’s applied to a series of machine learning algorithms presented in 
the book, such as linear models, neural networks, and gradient boosting machines. 


Gradient descent works out a solution by starting from a random solution when 
given a set of parameters (a data matrix made of features and a response). It then 
proceeds in various iterations using the feedback from the cost function, thus 
changing its parameters with values that gradually improve the initial random 
solution and lower the error. Even though the optimization may take a large num- 
ber of iterations before reaching a good mapping, it relies on changes that improve 
the response cost function most (lower error) during each iteration. Figure 10-3 
shows an example of a complex optimization process with many local minima 
(the minimum points on the curve marked with letters) where the process can get 
stuck (it no longer continues after the deep minimum marked with an asterisk) 
and cannot continue its descent. 
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You can visualize the optimization process as a walk in high mountains, with the 
parameters being the different paths to descend to the valley. A gradient descent 
optimization occurs at each step. At each iteration, the algorithm chooses the path 
that reduces error the most, regardless of the direction taken. The idea is that if 
steps aren’t too large (causing the alogorithm to jump over the target), always 
following the most downward direction will result in finding the lowest place. 
Unfortunately, this result doesn’t always occur because the algorithm can arrive at 
intermediate valleys, creating the illusion that it has reached the target. However, 
in most cases, gradient descent leads the machine learning algorithm to discover 
the right hypothesis for successfully mapping the problem. Figure 10-4 shows 
how a different starting point can make the difference. Starting point x1 ends 
toward a local minimum, whereas points x2 and x3 reach the global minimum. 
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FIGURE 10-4: 
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In an optimization process, you distinguish between different optimization out- 
comes. You can have a global minimum that’s truly the minimum error from the 
cost function, and you can have many local minima — solutions that seem to pro- 

TIP duce the minimum error but actually don’t (the intermediate valleys where the 
algorithm gets stuck). As a remedy, given the optimization process’s random ini- 
tialization, running the optimization many times is good practice. This means 
trying different sequences of descending paths and not getting stuck in the same 
local minimum. 


MARKETING IN THE MOMENT 
WITH ROCKET FUEL 


Online machine learning is more common than you may think. The constant flow of 
information available on the Internet increases as the world becomes more digitized. 
One of the more interesting applications that feature online machine learning is Rocket 
Fuel (http: //rocket fuel .com/), which provides a useful and unique programmatic 
trading platform. 


Programmatic trading involves buying large amounts of goods or services automatically 
based on a combination of machine-based transactions, algorithms, and data. When 
used for advertising, the sector in which Rocket Fuel operates, the object of program- 
matic trading is to align the selling side, represented by online publishers (websites on 
which the advertising is placed), to the buying side, represented by advertisers and 
advertising agencies. This approach helps advertising to reach people who are inter- 
ested in it (http: //rocket fuel .com/programmatic—marketing-platform/). 
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Rocket Fuel relies on Real Time Bidding (RTB) platforms (see http: //acuityads .com/ 
real—time-bidding/ and http: //digiday.com/plat forms/what-is-real- 
time—bidding/) as a smart approach that uses machine learning to connect the dots. 
Machine learning determines how best to match the audience of a website to a specific 
piece of advertising. 


The machine learning part of this approach relies on linear and logistic regressions, as 
well as neural networks trained with online learning (because information flow on the 
web is continuous, and the advertising scenery is changeable). The algorithms deter- 
mine whether a person will accept a particular piece of advertising based on the huge 
amount of data that the person produces online, such as interests hinted at by website 
behaviors and social networks (see http: //www. pnas .org/content/110/15/5802. 
ful1), site-provided social and demographic information, and e-commerce website 
preferences and intentions. For instance, by using machine learning techniques, Rocket 
Fuel can determine the right moment to offer a person information on a product or a 
service (nttp: //info. rocket fuel .com/marketing—in-the—moment), thus optimiz- 
ing communication between companies and consumers without wasting efforts and 
attention. 


Updating by Mini-Batch and Online 


Machine learning boils down to an optimization problem in which you look for a 
global minimum given a certain cost function. Consequently, working out an opti- 
mization using all the data available is clearly an advantage, because it allows 
checking, iteration by iteration, to determine the amount of minimization with 
respect to all the data. That’s the reason that most machine learning algorithms 
prefer to use all data available, and they want it accessible inside the computer 
memory. 


Learning techniques based on statistical algorithms use calculus and matrix alge- 
bra, and they need all data in memory. Simpler algorithms, such as those based on 
a step-by-step search of the next best solution by proceeding iteration by itera- 
tion through partial solution (such as the gradient descent discussed in the previ- 
ous section), can gain an advantage when developing a hypothesis based on all 
data because they can catch weaker signals on the spot and avoid being fooled by 
noise in data. 


When operating with data within the limits of the computer’s memory (assuming 
about 4GB or 8GB ), you’re working in core memory. You can solve most machine 
learning problems using this approach. Algorithms that work with core memory 
are called batch algorithms because, as in a factory where machines process single 
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batches of materials, such algorithms learn to handle and predict a single data 
batch at a time, represented by a data matrix. 


However, sometimes data can’t fit into core memory because it’s too big. Data 
derived from the web is a typical example of information that can’t fit easily into 
memory. In addition, data generated from sensors, tracking devices, satellites, 
and video monitoring are often problematic because of their dimensions when 
compared to computer RAM; however, they can be stored easily onto a hard disk, 
given the availability of cheap and large storage devices that easily hold terabytes 
of data. 


A few strategies can save the day when data is too big to fit into the standard 
memory of a single computer. A first solution you can try is to subsample. Data is 
reshaped by a selection of cases (and sometimes even features) based on statistical 
sampling into a more manageable, yet reduced, data matrix. Clearly, reducing the 
amount of data can’t always provide exactly the same results as when globally ana- 
lyzing it. Working on less than the available data can even produce less powerful 
models. Yet, if subsampling is executed properly, the approach can generate almost 
equivalent and still reliable results. A successful subsampling must correctly use 
statistical sampling, by employing random or stratified sample drawings. 


In random sampling, you create a sample by randomly choosing the examples that 
appear as part of the sample. The larger the sample, the more likely the sample 
will resemble the original structure and variety of data, but even with few drawn 
examples, the results are often acceptable, both in terms of representation of the 
original data and for machine learning purposes. 


In stratified sampling, you control the final distribution of the target variable or 
of certain features in data that you deem critical for successfully replicating the 
characteristics of your complete data. A classic example is to draw a sample ina 
classroom made up of different proportions of males and females in order to 
guess the average height. If females are, on average, shorter than and in smaller 
proportion to males, you want to draw a sample that replicates the same propor- 
tion in order to obtain a reliable estimate of the average height. If you sample 
only males by mistake, you’ll overestimate the average height. Using prior insight 
with sampling (such as knowing that gender can matter in height guessing) helps 
a lot in obtaining samples that are suitable for machine learning, as explained in 
Chapter 11. 


After you choose a sampling strategy, you have to draw a subsample of enough 
examples, given your memory limitations, to represent the variety of data. Data 
with high dimensionality, characterized by many cases and many features, is 
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more difficult to subsample because it needs a much larger sample, which may not 
even fit into your core memory. 


Beyond subsampling, a second possible solution to fitting data in memory is to 
leverage network parallelism, which splits data into multiple computers that are 
connected in a network. Each computer handles part of the data for optimization. 
After each computer has done its own computations and all the parallel optimiza- 
tion jobs have been reduced into a single elaboration, a solution is achieved. 


To understand how this solution works, compare the process of building a car 
piece by piece using a single worker to having many workers working separately 
on car part aggregates — leaving a single worker to perform the final assembly. 
Apart from having a faster assembly execution, you don’t have to keep all the 
parts in the factory at the same time. Similarly, you don’t have to keep all the data 
parts in a single computer, but can take advantage of their being processed sepa- 
rately in different computers, thus overcoming core memory limitations. 


This approach is the basis of the map-reduce technology and cluster-computer 
frameworks, Apache Hadoop and Apache Spark, which are focused on mapping a 
problem onto multiple machines and finally reducing their output into the desired 
solution. Unfortunately, you can’t easily split all machine learning algorithms into 
separable processes, and this problem limits the usability of such an approach. 
More important, you encounter significant cost and time overhead in setup and 
maintenance when you keep a network of computers ready for such data process- 
ing, thereby limiting the applicability of this approach to only large organizations. 


A third solution is to rely on out-of-core algorithms, which work by keeping data 
on the storage device and feeding it in chunks into computer memory for process- 
ing. The feeding process is called streaming. Because the chunks are smaller than 
core memory, the algorithm can handle them properly and use them for updating 
the machine learning algorithm optimization. After the update, the system dis- 
cards them in favor of new chunks, which the algorithm uses for learning. This 
process goes on repetitively until there are no more chunks. Chunks can be small 
(depending on core memory), and the process is called mini-batch learning, or they 
can even be constituted by just a single example, called online learning. 


The previously described gradient descent, as with other iterative algorithms, can 
work fine with such an approach; however, reaching an optimization takes longer 
because the gradient’s path is more erratic and nonlinear with respect to a batch 
approach. The algorithm can reach a solution using fewer computations with 
respect to its in-memory version. 
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TIP 


When working with repeated updates of its parameters based on mini-batches 
and single examples, the gradient descent takes the name stochastic gradient 
descent. It will reach a proper optimization solution given two prerequisites: 


>> The examples streamed are randomly extracted (hence the stochastic, 
recalling the idea of a random extraction from a distribution of examples). 


>> A proper learning rate is defined as fixed or flexible according to the number 
of observations or other criteria. 


Disregarding the first prerequisite implies that you must also consider the order- 
ing of the examples — a sometimes undesirable effect. The learning rate makes 
the learning more or less open to updates, rendering the learning itself more or 
less flexible in dealing with the characteristics of examples, which are seen later 
in the stream. 


The learning parameter can make a difference in the quality of the optimization 
because a high learning rate, though faster in the optimization, can constrain the 
parameters to the effects of noisy or erroneous examples seen at the beginning of 
the stream. A high learning rate also renders the algorithm insensible to the latter 
streamed observations, which can prove to be a problem when the algorithm is 
learning from sources that are naturally evolving and mutable, such as data from 
the digital advertising sector, where new advertising campaigns often start 
mutating the level of attention and response of targeted individuals. 
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IN THIS CHAPTER 


Explaining how correct sampling is 
critical in machine learning 


Highlighting errors dictated by bias 
and variance 


Proposing different approaches to 
validation and testing 


Warning against biased samples, 
overfitting, underfitting, and snooping 


Validating Machine 
Learning 


aving examples (in the form of datasets) and a machine learning algorithm 

at hand doesn’t assure that solving a learning problem is possible or that 

the results will provide the desired solution. For example, if you want your 
computer to distinguish a photo of a dog from a photo of a cat, you can provide it 
with good examples of dogs and cats. You then train a dog versus cat classifier 
based on some machine learning algorithm that could output the probability that 
a given photo is a dog or a cat. Of course, the output is a probability — not an 
absolute assurance that the photo is a dog or cat. 


Based on the probability that the classifier reports, you can decide the class (dog or 
cat) of a photo based on the estimated probability calculated by the algorithm. 
When the probability is higher for a dog, you can minimize the risk of making a 
wrong assessment by choosing the higher chances favoring a dog. The greater the 
probability difference between the likelihood of a dog against that of a cat, the 
higher the confidence you can have in your choice. A close choice likely occurs 
because of some ambiguity in the photo (the photo is not clear or the dog is actually 
a bit cattish). For that matter, it might not even be a dog — and the algorithm 
doesn’t know anything about the raccoon, which is what the picture actually shows. 
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Such is the power of training a classifier: You pose the problem; you offer the 
examples, with each one carefully marked with the label or class that the algo- 
rithm should learn; your computer trains the algorithm for a while; and finally 
you get a resulting model, which provides you with an answer or probability. 
(Labeling is a challenging activity in itself, as you discover in the chapters that 
follow.) In the end, a probability is just an opportunity (or a risk, from another 
perspective) to propose a solution and get a correct answer. At this point, you may 
seem to have addressed every issue and believe that the work is finished, but you 
must still validate the results. This chapter helps you discover why machine learn- 
ing isn’t just a push-the-button-and-forget-it activity. 


Checking Out-of-Sample Errors 
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REMEMBER 


When you first receive the data used to train the algorithm, the data is just a data 
sample. Unless the circumstances are quite rare, the data you receive won’t be all 
the data that you could possibly get. For instance, if you receive sales data from 
your marketing department, the data you receive is not all the possible sales data 
because unless sales are stopped, there will always be new data representing new 
sales in the future. 


If your data is not all the data possible, you must call it a sample. A sample is a 
selection, and as with all selections, the data could reflect different motivations as 
to why someone selected it in such a way. Therefore, when you receive data, the 
first question you have to consider is how someone has selected it. If someone 
selected it randomly, without any specific criteria, you can expect that, if things 
do not change from the past, future data won’t differ too much from the data you 
have at hand. 


Statistics expect that the future won’t differ too much from the past. Thus, you 
can base future predictions on past data by employing random sampling theory. If 
you select examples randomly without a criterion, you do have a good chance of 
choosing a selection of examples that won’t differ much from future examples, or, 
in statistical terms, you can expect that the distribution of your present sample 
will closely resemble the distribution of future samples. 


However, when the sample you receive is somehow special, it could present a 
problem when training the algorithm. In fact, the special data could force your 
algorithm to learn a different mapping to the response than the mapping it might 
have created by using random data. As an example, if you receive sales data from 
just one shop or only the shops in a single region (which is actually a specific 
sample), the algorithm may not learn how to forecast the future sales of all the 
shops in all the regions. The specific sample causes problems because other shops 
may be different and follow different rules from the ones you’re observing. 
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REMEMBER 


Ensuring that your algorithm is learning correctly from data is the reason you 
should always check what the algorithm has learned from in-sample data (the data 
used for training) by testing your hypothesis on some out-of-sample data. 
Out-of-sample data is data you didn’t have at learning time, and it should repre- 
sent the kind of data you need to create forecasts. 


Looking for generalization 


Generalization is the capability to learn from data at hand the general rules that 
you can apply to all other data. Out-of-sample data therefore becomes essential to 
figuring out whether learning from data is possible, and to what extent. 


No matter how big your in-sample dataset is, bias created by some selection cri- 
teria still makes seeing similar examples frequently and systematically highly 
unlikely in reality. For example, in statistics, there is an anecdote about inferring 
from biased samples. It involves the 1936 U.S. presidential election between Alfred 
Landon and Franklin D. Roosevelt in which the Literary Digest used biased poll 
information to predict the winner. 


At that time, the Literary Digest, a respectable and popular magazine, polled its 
readers to determine the next president of the United States, a practice that it had 
performed successfully since 1916. The response of the poll was strikingly in favor 
of Landon, with more than a 57 percent consensus on the candidate. The maga- 
zine also used such a huge sample — more than 10 million people (with only 
2.4 million responding) — that the result seemed unassailable: A large sample 
coupled with a large difference between the winner and the loser tends not to raise 
many doubts. Yet the poll was completely unsuccessful. In the end, the margin of 
error was 19 percent, with Landon getting only 38 percent of the vote and Roos- 
evelt getting 62 percent. This margin is the largest error ever for a public opinion 
poll. 


What happened? Well, simply put, the magazine questioned people whose names 
were pulled from every telephone directory in United States, as well as from the 
magazine’s subscription list and from rosters of clubs and associations, gathering 
more than ten million names. Impressive, but at the end of the Great Depression, 
having a telephone, subscribing to a magazine, or being part of a club meant that 
you were rich, so the sample was made of only affluent voters and completely 
ignored lower-income voters, who happen to represent the majority (thereby 
resulting in a selection bias). In addition, the poll suffered from a nonresponsive bias 
because only 2.4 million people responded, and people who respond to polls tend 
to differ from those who don’t. (You can read more about the faulty Literary Digest 
poll at https: //www.math.upenn.edu/*deturck/m170/wk4/lecture/caset .htm1.) 
The magnitude of error for this particular incident ushered in the beginning of a 
more scientific approach to sampling. 
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REMEMBER 


Such classical examples of selection bias point out that if the selection process 
biases a sample, the learning process will have the same bias. However, sometimes 
bias is unavoidable and difficult to spot. As an example, when you go fishing with 
a net, you can see only the fish you catch and that didn’t pass through the net itself. 


Another example comes from World War II. At that time, designers constantly 
improved U.S. war planes by adding extra armor plating to the parts that took the 
most hits upon returning from bombing runs. It took the reasoning of the math- 
ematician Abraham Wald to point out that designers actually needed to reinforce 
the places that didn’t have bullet holes on returning planes. These locations were 
likely so critical that a plane hit there didn’t return home, and consequently no 
one could observe its damage (a kind of survivorship bias where the survivors skew 
the data). Survivorship bias is still a problem today. In fact, you can read about 
how this story has shaped the design of Facebook at http://www. fastcodesign. 
com/1671172/how-a-story-from-—wor 1d-—war-ii-shapes-—facebook-today. 





Preliminary reasoning on your data and testing results with out-of-sample 
examples can help you spot or at least have an intuition of possible sampling 
problems. However, receiving new out-of-sample data is often difficult, costly, 
and requires investment in terms of timing. In the sales example discussed ear- 
lier, you have to wait for a long time to test your sales forecasting model — maybe 
an entire year — in order to find out whether your hypothesis works. In addition, 
making the data ready for use can consume a great deal of time. For example, 
when you label photos of dogs and cats, you need to spend time labeling a larger 
number of photos taken from the web or from a database. 


A possible shortcut to expending additional effort is getting out-of-sample exam- 
ples from your available data sample. You reserve a part of the data sample based 
on a separation between training and testing data dictated by time or by random 
sampling. If time is an important component in your problem (as it is in forecast- 
ing sales), you look for a time label to use as separator. Data before a certain date 
appears as in-sample data; data after that date appears as out-of-sample data. 
The same happens when you choose data randomly: What you extracted as in- 
sample data is just for training; what is left is devoted to testing purposes and 
serves as your out-of-sample data. 


Getting to Know the Limits of Bias 
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Now that you know more about the in-sample and out-of-sample portions of your 
data, you also know that learning depends a lot on the in-sample data. This portion 
of your data is important because you want to discover a point of view of the world, 
and as with all points of view, it can be wrong, distorted, or just merely partial. You 
also know that you need an out-of-sample example to check whether the learning 
process is working. However, these aspects form only part of the picture. When you 
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make a machine learning algorithm work on data in order to guess a certain 
response, you are effectively taking a gamble, and that gamble is not just because of 
the sample you use for learning. There’s more. For the moment, imagine that you 
freely have access to suitable, unbiased, in-sample data, so data is not the problem. 
Instead you need to concentrate on the method for learning and predicting. 


First, you must consider that you’re betting that the algorithm can reasonably 
guess the response. You can’t always make this assumption because figuring out 
certain answers isn’t possible no matter what you know in advance. For instance, 
you can’t fully determine the behavior of human beings by knowing their previous 
history and behavior. Maybe a random effect is involved in the generative process 
of our behavior (the irrational part of us, for instance), or maybe the issue comes 
down to free will (the problem is also a philosophical/religious one, and there are 
many discordant opinions). Consequently, you can guess only some types of 
responses, and for many others, such as when you try to predict people’s behav- 
ior, you have to accept a certain degree of uncertainty (which, with luck, is accept- 
able for your purposes). 


Second, you must consider that you’re betting that the relationship between the 
information you have and the response you want to predict can be expressed as a 
mathematical formula of some kind, and that your machine learning algorithm is 
actually capable of guessing that formula. The capacity of your algorithm to guess 
the mathematical formula behind a response is intrinsically embedded in the nuts 
and bolts of the algorithm. Some algorithms can guess almost everything; others 
actually have a limited set of options. The range of possible mathematical formu- 
lations that an algorithm can guess is the set of its possible hypothesis. Conse- 
quently, a hypothesis is a single algorithm, specified in all its parameters and 
therefore capable of a single, specific formulation. 


Mathematics is fantastic. It can describe much of the real world by using some 
simple notation, and it’s the core of machine learning because any learning algo- 
rithm has a certain capability to represent a mathematical formulation. Some 
algorithms, such as linear regression, explicitly use a specific mathematical for- 
mulation for representing how a response (for instance, the price of a house) 
relates to a set of predictive information (such as market information, house loca- 
tion, surface of the estate, and so on). 


Some formulations are so complex and intricate that even though representing 
them on paper is possible, doing so is too difficult in practical terms. Some other 
sophisticated algorithms, such as decision trees (a topic of the next chapter), don’t 
have an explicit mathematical formulation, but are so adaptable that they can be set 
to approximate a large range of formulations easily. As an example, consider a sim- 
ple and easily explained formulation. The linear regression is just a line in a space 
of coordinates given by the response and all the predictors. In the easiest example, 
you can have a response, y, and a single predictor, x, with a formulation of 


y =Bix1 +o 
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FIGURE 11-1: 


Example of a 
linear model 


strug 


gling to map 


a curve function. 


REMEMBER 


In a simple situation of a response predicted by a single feature, such a model is 
perfect when your data arranges itself as a line. However, what happens if it 
doesn’t and instead shapes itself like a curve? To represent the situation, just 
observe the following bidimensional representations, as shown in Figure 11-1. 


Undefitting / high bias 





response y 
response y 


—— Model 
—- True function 


0.0 0.2 04 0.6 08 10 
feature x 


When points resemble a line or a cloud, some error occurs when you’re figuring 
out that the result is a straight line; therefore, the mapping provided by the pre- 
ceding formulation is somehow imprecise. However, the error doesn’t appear sys- 
tematically but rather randomly because some points are above the mapped line 
and others are below it. The situation with the curved, shaped cloud of points is 
different, because this time, the line is sometimes exact but at other times is sys- 
tematically wrong. Sometimes points are always above the line; sometimes they 
are below it. 


Given the simplicity of its mapping of the response, your algorithm tends to sys- 
tematically overestimate or underestimate the real rules behind the data, repre- 
senting its bias. The bias is characteristic of simpler algorithms that can’t express 
complex mathematical formulations. 


Keeping Model Complexity in Mind 
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Just as simplicity of formulations is a problem, automatically resorting to map- 
ping very intricate formulations doesn’t always provide a solution. In fact, you 
don’t know the true complexity of the required response mapping (such as whether 


PART 3 Getting Started with the Math Basics 


FIGURE 11-2: 
Example of a 
linear model 
going right and 
becoming too 
complex while 
trying to map a 
curve function. 


it fits in a straight line or in a curved one). Therefore, just as simplicity may create 
an unsuitable response (refer to Figure 11-1), it’s also possible to represent the 
complexity in data with an overly complex mapping. In such case, the problem 
with a complex mapping is that it has many terms and parameters — and in some 
extreme cases, your algorithm may have more parameters than your data has 
examples. Because you must specify all the parameters, the algorithm then starts 
memorizing everything in the data — not just the signals but also the random 
noise, the errors, and all the slightly specific characteristics of your sample. 


In some cases, it can even just memorize the examples as they are. However, 
unless you’re working on a problem with a limited number of simple features with 
few distinct values (basically a toy dataset, that is, a dataset with few examples 
and features, thus simple to deal with and ideal for examples), you’re highly 
unlikely to encounter the same example twice, given the enormous number of 
possible combinations of all the available features in the dataset. 


When memorization happens, you may have the illusion that everything is work- 
ing well because your machine learning algorithm seems to have fitted the in- 
sample data so well. Instead, problems can quickly become evident when you start 
having it work with out-of-sample data and you notice that it produces errors in 
its predictions as well as errors that actually change a lot when you relearn from 
the same data with a slightly different approach. Overfitting occurs when your 
algorithm has learned too much from your data, up to the point of mapping curve 
shapes and rules that do not exist, as shown in Figure 11-2. Any slight change in 
the procedure or in the training data produces erratic predictions. 


Best fitting Overfitting / high variance 





response y 


—— Model 
-- True function 


0.0 0.2 04 0.6 08 10 0.0 0.2 04 0.6 08 10 
feature x feature x 
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Keeping Solutions Balanced 
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REMEMBER 


To create great solutions, machine learning models trade off between simplicity 
(implying a higher bias) and complexity (generating a higher variance of esti- 
mates). If you intend to achieve the best predictive performance, you do need to 
find a solution in the middle by understanding what works better, which you do 
by using trial and error on your data. Because data is what dictates the most suit- 
able solution for the prediction problem, you have neither a panacea nor an easy 
recurrent solution for solving all your machine learning dilemmas. 


A commonly referred theorem in the mathematical folklore is the no-free-lunch 
theorem by David Wolpert and William Macready, which states that “any two 
optimization algorithms are equivalent when their performance is averaged across 
all possible problems” (see https: //en.wikipedia.org/wiki/No_free_lunch_ 
theorem for details). If the algorithms are equivalent in the abstract, no one is 
superior to the other unless proved in a specific, practical problem. (See the dis- 
cussion at http: //www.no-free-lunch.org/ for more details about no-free- 
lunch theorems; two of them are actually used for machine learning.) 


In particular, in his article “The Lack of A Priori Distinctions between Learning 
Algorithms,” Wolpert discussed the fact that there are no a priori distinctions 
between algorithms, no matter how simple or complex they are (you can obtain 
the article at http: //citeseerx.ist.psu.edu/viewdoc/summary ?doi=1@.1.1.51. 
9734). Data dictates what works and how well it works. In the end, you cannot 
always rely on a single machine learning algorithm, but you have to test many and 
find the best one for your problem. 


Besides being led into machine learning experimentation by the try-everything 
principle of the no-free-lunch theorem, you have another rule of thumb to con- 
sider: Occam’s razor, which is attributed to William of Occam, a scholastic philoso- 
pher and theologian who lived in the fourteenth century (see http: //math.ucr. 
edu/home/baez/physics/General/occam.html for details). The Occam’s razor 
principle states that theories should be cut down to the minimum in order to plau- 
sibly represent the truth (hence the razor). The principle doesn’t state that simpler 
solutions are better, but that, between a simple solution and a more complex solu- 
tion offering the same result, the simpler solution is always preferred. The princi- 
ple is at the very foundations of our modern scientific methodology, and even 
Albert Einstein seems to have often referred to it, stating that “everything should 
be as simple as it can be, but not simpler” (see http: //quoteinvestigator. 
com/2011/05/13/einstein-simple/ for details). Summarizing the evidence so far: 


>> To get the best machine learning solution, try everything you can on your data 
and represent your data’s performance with learning curves. 


PART 3 Getting Started with the Math Basics 





REMEMBER 


>> Start with simpler models, such as linear models, and always prefer a simpler 
solution when it performs nearly as well as a complex solution. You benefit 
from the choice when working on out-of-sample data from the real world. 


>» Always check the performance of your solution using out-of-sample examples, 
as discussed in the preceding sections. 


Depicting learning curves 


To visualize the degree to which a machine learning algorithm is suffering from 
bias or variance with respect to a data problem, you can take advantage of a chart 
type named learning curve. Learning curves are displays in which you plot the per- 
formance of one or more machine learning algorithms with respect to the quan- 
tity of data they use for training. The plotted values are the prediction error 
measurements, and the metric is measured both as in-sample and cross-validated 
or out-of-sample performance. 


If the chart depicts performance with respect to the quantity of data, it’s a learn- 
ing curve chart. When it depicts performance with respect to different hyper- 
parameters or a set of learned features picked by the model, it’s a validation curve 
chart instead. To create a learning curve chart, you must do the following: 


>? Divide your data into in-sample and out-of-sample sets (a train/test split of 
70/30 works fine, or you can use cross-validation). 


>» Create portions of your training data of growing size. Depending on the size of 
the data that you have available for training, you can use 10 percent portions 
or, if you have a lot of data, grow the number of examples on a power scale 
such as 103, 104, 10° and so on. 


>> Train models on the different subsets of the data. Test and record their 
performance on the same training data and on the out-of-sample set. 


>> Plot the recorded results on two curves, one for the in-sample results and the 
other for the out-of-sample results (see Figure 11-3). If instead of a train/test 
split you used cross-validation, you can also draw boundaries expressing the 
stability of the result across multiple validations (confidence intervals) based 
on the standard deviation of the results themselves. 


Ideally, you should obtain two curves with different starting error points: higher 
for the out-of-sample; lower for the in-sample. As the size of the training set 
increases, the difference in space between the two should reduce until, at a certain 
number of observations, they become close to a common error value. 
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FIGURE 11-3: 
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Noticeably, after you print your chart, problems arise when 


>> The two curves tend to converge, but you can't see on the chart that 
they get near each other because you have too few examples. This 
situation gives you a strong hint to increase the size of your dataset if you 
want to successfully learn with the tested machine learning algorithm. 


>> The final convergence point between the two curves has a high error, so 
consequently your algorithm has too much bias. Adding more examples 
here does not help because you have a convergence with the amount of data 
you have. You should increase the number of features or use a more complex 
learning algorithm as a solution. 


>> The two curves do not tend to converge because the out-of-sample curve 
starts to behave erratically. Such a situation is clearly a sign of high variance 
of the estimates, which you can reduce by increasing the number of examples 
(at a certain number, the out-of-sample error will start to decrease again), 
reducing the number of features, or, sometimes, just fixing some key param- 
eters of the learning algorithm. 


R doesn’t provide an implementation of learning curves (so you need to program 
one of your own). Python provides learning curves as part of the Scikit-learn 
package using the learning_curve function that prepares all the computations for 
you (see the details at http: //scikit-learn.org/stable/modules/generated/ 
sklearn. learning_curve. learning_curve.htm1). The example in the book con- 
veniently wraps the function to perform the plot of the results as well because the 
function outputs just a sequence of vectors of results. 
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Training, Validating, and Testing 


TIP 


In a perfect world, you could perform a test on data that your machine learning 
algorithm has never learned from before. However, waiting for fresh data isn’t 
always feasible in terms of time and costs. As a first simple remedy, you can ran- 
domly split your data into training and test sets. The common split is from 25 to 
30 percent for testing and the remaining 75 to 70 percent for training. You split 
your data consisting of your response and features at the same time, keeping cor- 
respondence between each response and its features. 


The second remedy occurs when you need to tune your learning algorithm. In this 
case, the test split data isn’t a good practice because it causes another kind of 
overfitting called snooping (see more on this topic later in the chapter). To over- 
come snooping, you need a third split, called a validation set. A suggested split is 
to have your examples partitioned in thirds: 70 percent for training, 20 percent for 
validation, and 10 percent for testing. 


You should perform the split randomly, that is, regardless of the initial ordering 
of the data. Otherwise, your test won’t be reliable, because ordering could cause 
overestimation (when there is some meaningful ordering) or underestimation (when 
distribution differs by too much). As a solution, you must ensure that the test set 
distribution isn’t very different from the training distribution, and that sequential 
ordering occurs in the split data. For example, check whether identification num- 
bers, when available, are continuous in your sets. Sometimes, even if you strictly 
abide by random sampling, you can’t always obtain similar distributions among 
sets, especially when your number of examples is small. 


When your number of examples n is high, such as n>10,000, you can quite confi- 
dently create a randomly split dataset. When the dataset is smaller, comparing 
basic statistics such as mean, mode, median, and variance across the response and 
features in the training and test sets will help you understand whether the test set 
is unsuitable. When you aren’t sure that the split is right, just recalculate a new one. 


Resorting to Cross-Validation 


A noticeable problem with the train/test set split is that you’re actually introduc- 
ing bias into your testing because you’re reducing the size of your in-sample 
training data. When you split your data, you may be actually keeping some useful 
examples out of training. Moreover, sometimes your data is so complex that a test 
set, though apparently similar to the training set, is not really similar because 
combinations of values are different (which is typical of highly dimensional data- 
sets). These issues add to the instability of sampling results when you don’t have 
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FIGURE 11-4: 
A graphical 


representation 


of how 


cross-validation 
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works. 


many examples. The risk of splitting your data in an unfavorable way also explains 
why the train/test split isn’t the favored solution by machine learning practitio- 
ners when you have to evaluate and tune a machine learning solution. 


Cross-validation based on k-folds is actually the answer. It relies on random split- 
ting, but this time it splits your data into a number k of folds (portions of your data) 
of equal size. Then, each fold is held out in turn as a test set and the others are used 
for training. Each iteration uses a different fold as a test, which produces an error 
estimate. In fact, after completing the test on one fold against the others used as 
training, a successive fold, different from the previous, is held out and the procedure 
is repeated in order to produce another error estimate. The process continues until 
all the k-folds are used once as a test set and you have a k number of error estimates 
that you can compute into a mean error estimate (the cross-validation score) and a 
standard error of the estimates. Figure 11-4 shows how this process works. 


FOLD1 FOLD 2 FOLD 3 FOLD 4 FOLD5 


ITERATION 5 TEST TRAIN TRAIN TRAIN TRAIN 
DATASET PARTITIONED INTO FOLDS 













This procedure provides the following advantages: 


>> It works well regardless of the number of examples, because by increasing the 
number of used folds, you are actually increasing the size of your training set 
(larger k, larger training set, reduced bias) and decreasing the size of the test set. 


> Differences in distribution for individual folds don’t matter as much. When a fold 
has a different distribution compared to the others, it's used just once as a test set 
and is blended with others as part of the training set during the remaining tests. 


>> You are actually testing all the observations, so you are fully testing your 
machine learning hypothesis using all the data you have. 


>» By taking the mean of the results, you can expect a predictive performance. In 
addition, the standard deviation of the results can tell you how much variation 
you can expect in real out-of-sample data. Higher variation in the cross-validated 
performances informs you of extremely variegated data that the algorithm is 
incapable of properly catching. 
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Using k-fold cross-validation is always the optimal choice unless the data you’re 
using has some kind of order that matters. For instance, it could involve a time 
series, such as sales. In that case, you shouldn’t use a random sampling method 
but instead rely on a train/test split based on the original sequence so that the 
order is preserved and you can test on the last examples of that ordered series. 


Looking for Alternatives in Validation 





REMEMBER 


You have a few alternatives to cross-validation, all of which are derived from sta- 
tistics. The first one to consider — but only if you have an in-sample made of few 
examples — is the leave-one-out cross-validation (LOOCV). It is analogous to 
k-folds cross-validation, with the only difference being that k, the number of 
folds, is exactly n, the number of examples. Therefore, in LOOCV, you build n mod- 
els (which may turn into a huge number when you have many observations) and 
test each one on a single out-of-sample observation. Apart from being computa- 
tional intensive and requiring that you build many models to test your hypothesis, 
the problem with LOOCV is that it tends to be pessimistic (making your error esti- 
mate higher). It’s also unstable for a small number of n, and the variance of the 
error is much higher. All these drawbacks make comparing models difficult. 


Another alternative from statistics is bootstrapping, a method long used to esti- 
mate the sampling distribution of statistics, which are presumed not to follow a 
previously assumed distribution. Bootstrapping works by building a number (the 
more the better) of samples of size n (the original in-sample size) drawn with 
repetition. To draw with repetition means that the process could draw an example 
multiple times to use it as part of the bootstrapping resampling. Bootstrapping 
has the advantage of offering a simple and effective way to estimate the true error 
measure. In fact, bootstrapped error measurements usually have much less vari- 
ance than cross-validation ones. On the other hand, validation becomes more 
complicated due to the sampling with replacement, so your validation sample 
comes from the out-of-bootstrap examples. Moreover, using some training sam- 
ples repeatedly can lead to a certain bias in the models built with bootstrapping. 


If you are using out-of-bootstrapping examples for your test, you’ll notice that 
the test sample can be of various sizes, depending on the number of unique exam- 
ples in the in-sample, likely accounting for about a third of your original in- 
sample size. This simple Python code snippet demonstrates randomly simulating 
a certain number of bootstraps: 


from random import randint 


import numpy as np 
n = 1000 # number of examples 


CHAPTER 11 Validating Machine Learning 193 


# your original set of examples 
examples = set(range(n) ) 
results = list() 
for j in range(1@000): 
# your boostrapped sample 
chosen = [randint(@,n) for k in range(n)] 
# out-of-sample 
results. append( (1@@0-len(set(choosen)&examples ) ) 
/float(n) ) 
print ("Out-of-boostrap: %0.1f %%" % 
(np.mean(results)*1@@) ) 


Out-of-boostrap: 36.8 % 


Running the experiment may require some time, and your results may be differ- 
ent due to the random nature of the experiment. However, you should see an 
output of around 36.8 percent. 


Optimizing Cross-Validation Choices 


Being able to validate a machine learning hypothesis effectively allows further 
optimization of your chosen algorithm. As discussed in the previous sections, the 
algorithm provides most of the predictive performance on your data, given its abil- 
ity to detect signals from data and fit the true functional form of the predictive 
function without overfitting and generating much variance of the estimates. Not 
every machine learning algorithm is a best fit for your data, and no single algorithm 
can suit every problem. It’s up to you to find the right one for a specific problem. 


A second source of predictive performance is the data itself when appropriately 
transformed and selected to enhance the learning capabilities of the chosen algo- 
rithm. Chapter 13, on data preprocessing, discusses this issue of transforming and 
selecting data. 


The final source of performance derives from fine-tuning the algorithm’s hyper- 
parameters, which are the parameters that you decide before learning happens and 
that aren’t learned from data. Their role is in defining a priori a hypothesis, 
whereas other parameters specify it a posteriori, after the algorithm interacts with 
the data and, by using an optimization process, finds that certain parameter val- 
ues work better in obtaining good predictions. Not all machine learning algo- 
rithms require much hyper-parameter tuning, but some of the most complex 
ones do, and though such algorithms still work out of the box, pulling the right 
levers may make a large difference in the correctness of the predictions. Even 
when the hyper-parameters aren’t learned from data, you should consider the 
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data you’re working on when deciding hyper-parameters, and you should make 
the choice based on cross-validation and careful evaluation of possibilities. 


Complex machine learning algorithms, the ones most exposed to variance of 
estimates, present many choices expressed in a large number of parameters. 
Twiddling with them makes them adapt more or less to the data they are learning 
from. Sometimes too much hyper-parameter twiddling may even make the algo- 
rithm detect false signals from the data. That makes hyper-parameters them- 
selves an undetected source of variance if you start manipulating them too much 
based on some fixed reference like a test set or a repeated cross-validation schema. 


Both R and Python offer slicing functionalities that slice your input matrix into 
train, test, and validation parts. In particular, for more complex testing proce- 
dures, such as cross-validation or bootstrapping, the Scikit-learn package offers 
an entire module (http: //scikit-learn.org/stable/modules/classes.html# 
module-sklearn.cross_validation), and R has a specialized package, offering 
functions for data splitting, preprocessing, and testing. This package is called 
caret (http: //topepo. github. io/caret/index.htm1). In Part 5 of the book, you 
discover how to apply machine learning to real problems, including some practi- 
cal examples using both these packages. 


Exploring the space of hyper-parameters 


The possible combinations of values that hyper-parameters may form make 
deciding where to look for optimizations hard. As described when discussing gra- 
dient descent, an optimization space may contain value combinations that per- 
form better or worse. Even after you find a good combination, you’re not assured 
that it’s the best option. (This is the problem of getting stuck in local minima 
when minimizing the error, an issue described in Chapter 10 when talking about 
gradient descent’s problems.) 


As a practical way of solving this problem, the best way to verify hyper-parameters 
for an algorithm applied to specific data is to test them all by cross-validation, 
and to pick the best combination. This simple approach, called grid-search, offers 
indisputable advantages by allowing you to sample the range of possible values to 
input into the algorithm systematically and to spot when the general minimum 
happens. On the other hand, grid-search also has serious drawbacks because it’s 
computationally intensive (you can easily perform this task in parallel on modern 
multicore computers) and quite time consuming. Moreover, systematic and inten- 
sive tests enhance the possibility of incurring error because some good but fake 
validation results can be caused by noise present in the dataset. 


Some alternatives to grid-search are available. Instead of testing everything, you 
can try exploring the space of possible hyper-parameter values guided by compu- 


tationally heavy and mathematically complex nonlinear optimization techniques 
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FIGURE 11-5: 
Comparing 
grid-search to 
random search. 


TIP 


(like the Nelder-Mead method), using a Bayesian approach (where the number of 
tests is minimized by taking advantage of previous results) or using random 
search. 


Surprisingly, random search works incredibly well, is simple to understand, and 
isn’t just based on blind luck, though it may initially appear to be. In fact, the 
main point of the technique is that if you pick enough random tests, you actually 
have enough possibilities to spot the right parameters without wasting energy on 
testing slightly different combinations of similarly performing combinations. 


The graphical representation shown in Figure 11-5 explains why random search 
works fine. A systematic exploration, though useful, tends to test every combina- 
tion, which turns into a waste of energy if some parameters don’t influence the 
result. A random search actually tests fewer combinations but more in the range 
of each hyper-parameter, a strategy that proves winning if, as often happens, 
certain parameters are more important than others. 


@ Exaustive grid 
M Random layout 


hyperparameter gamma 





0.0 0.2 04 0.6 08 10 
hyperparameter alpha 


For randomized search to perform well, you should make from 15 to a maximum 
of 60 tests. It does make sense to resort to random search if a grid-search requires 
a larger number of experiments. 


Avoiding Sample Bias and Leakage Traps 


One final aspect to include in this chapter’s overview of the validation approach to 
machine learning is an examination of a possible remedy to in-sampling bias. In- 
sampling bias can happen to your data before machine learning is put into action, 
and it causes high variance of the following estimates. In addition, this section 
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provides a warning about leakage traps that can occur when some information 
from the out-of-sample passes to in-sample data. This issue can arise when you 
prepare the data or after your machine learning model is ready and working. 


The remedy, which is called ensembling of predictors, works perfectly when your 
training sample is not completely distorted and its distribution is different from 
the out-of-sample, but not in an irremediable way, such as when all your classes 
are present but not in the right proportion (as an example). In such cases, your 
results are affected by a certain variance of the estimates that you can possibly 
stabilize in one of several ways: by resampling, as in bootstrapping; by subsam- 
pling (taking a sample of the sample); or by using smaller samples (which 
increases bias). 


To understand how ensembling works so effectively, visualize the image of a 
bull’s eye. If your sample is affecting the predictions, some predictions will be 
exact and others will be wrong in a random way. If you change your sample, the 
right predictions will keep on being right, but the wrong ones will start being 
variations between different values. Some values will be the exact prediction you 
are looking for; others will just oscillate around the right one. 


By comparing the results, you can guess that what is recurring is the right answer. 
You can also take an average of the answers and guess that the right answer 
should be in the middle of the values. With the bull’s-eye game, you can visualize 
superimposing photos of different games: If the problem is variance, ultimately 
you will guess that the target is in the most frequently hit area or at least at the 
center of all the shots. 


In most cases, such an approach proves to be correct and improves your machine 
learning predictions a lot. When your problem is bias and not variance, using 
ensembling really doesn’t cause harm unless you subsample too few samples. A 
good rule of thumb for subsampling is to take a sample from 70 to 90 percent 
compared to the original in-sample data. If you want to make ensembling work, 
you should do the following: 


1. iterate a large number of times through your data and models (from just a 
minimum of three iterations to ideally hundreds of times of them). 
2. Every time you iterate, subsample (or else bootstrap) your in-sample data. 


3. Use machine learning for the model on the resampled data, and predict the 
out-of-sample results. Store those results away for later use. 


4. atthe end of the iterations, for every out-of-sample case you want to predict, 
take all its predictions and average them if you are doing a regression. Take the 
most frequent class if you are doing a classification. 
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Watching out for snooping 


Leakage traps can surprise you because they can prove to be an unknown and 
undetected source of problems with your machine learning processes. The prob- 
lem is snooping, or otherwise observing the out-of-sample data too much and 
adapting to it too often. In short, snooping is a kind of overfitting — and not just 
on the training data but also on the test data, making the overfitting problem 
itself harder to detect until you get fresh data. Usually you realize that the problem 
is snooping when you already have applied the machine learning algorithm to 
your business or to a service for the public, making the problem an issue that 
everyone can see. 


You can avoid snooping in two ways. First, when operating on the data, take care 
to neatly separate training, validation, and test data. Also, when processing, never 
take any information from validation or test, even the most simple and innocent- 
looking examples. Worse still is to apply a complex transformation using all the 
data. In finance, for instance, it is well known that calculating the mean and the 
standard deviation (which can actually tell you a lot about market conditions and 
risk) from all training and testing data can leak precious information about your 
models. When leakage happens, machine learning algorithms perform predictions 
on the test set rather than the out-of-sample data from the markets, which means 
that they didn’t work at all, thereby causing a loss of money. 


Check the performance of your out-of-sample examples. In fact, you may bring 
back some information from your snooping on the test results to help you deter- 
mine that certain parameters are better than others, or lead you to choose one 
machine learning algorithm instead of another. For every model or parameter, 
apply your choice based on cross-validation results or from the validation 
sample. Never fall for getting takeaways from your out-of-sample data or you’ll 
regret it later. 
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IN THIS CHAPTER 


Trying a perceptron to separate 
classes by a line 


Partitioning recursively training data 
by decision trees 


Discovering the rules behind playing 
tennis and surviving the Titanic 


Leveraging Bayesian probability to 
analyze textual data 


Starting with Simple 
Learners 


eginning with this chapter, the examples start illustrating the basics of how 

to learn from data. The plan is to touch some of the simplest learning strat- 

egies first — providing some formulas (just those that are essential), intu- 
itions about their functioning, and examples in R and Python for experimenting 
with some of their most typical characteristics. The chapter begins by reviewing 
the use of the perceptron to separate classes. 


At the root of all principal machine learning techniques presented in the book, 
there is always an algorithm based on somewhat interrelated linear combinations, 
variations of the sample splitting of decision trees, or some kind of Bayesian 
probabilistic reasoning. This chapter uses classification trees to demonstrate the 
technique. The only exception is the K-Nearest Neighbors (KNN) algorithm, 
which, based on analogical reasoning, is treated apart in a special chapter devoted 
to detection of similarity in data (Chapter 14). 


Getting a grasp on these basic techniques means being able to deal with more 
complex learning techniques later and being able to understand (and use) them 
better. It may appear incredible now, but you can create some of the most effective 
algorithms using ensembles of the simplest algorithms — those viewed as weak 
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learners. In this chapter, you see how to use various techniques to predict when 
playing tennis is appropriate based on weather conditions. 


At the end of the journey, no algorithm will appear as a black box anymore. 
Machine learning has a strong intuitive and human component because it is a 
human creation (at least for the moment, unless any singularity appears), and it 
is based on analogies of how we learn from the world or on the imitation of nature 
(for instance, on how we know the brain works). If core ideas of the discipline are 
conveyed, no algorithm is really too difficult to understand. The chapter demon- 
strates this approach using Bayesian probability to analyze text samples. 


Discovering the Incredible Perceptron 
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You can start the journey toward discovering how machine learning algorithms 
work by looking at models that figure out their answers using lines and surfaces 
to divide examples into classes or to estimate value predictions. These are linear 
models, and this chapter presents one of the earliest linear algorithms used in 
machine learning: the perceptron. Later chapters complete the overview of linear 
models. Chapter 15 introduces linear regression and its family of statistically 
derived algorithms; Chapter 16 helps you discover neural networks; and Chapter 17 
talks about support vector machines. However, before you can advance to these 
other topics, you should understand the interesting history of the perceptron. 


Falling short of a miracle 


Frank Rosenblatt at the Cornell Aeronautical Laboratory devised the perceptron in 
1957 under the sponsorship of the United States Naval Research. Rosenblatt was a 
psychologist and pioneer in the field of artificial intelligence. Proficient in cogni- 
tive science, it was his idea to create a computer that could learn by trial and error, 
just as a human does. 


The idea was successfully developed, and at the beginning, the perceptron wasn’t 
conceived as just a piece of software; it was created as software running on dedi- 
cated hardware. The use of the combination allowed faster and more precise 
recognition of complex images than any other computer could do at the time. The 
new technology raised great expectations and caused a huge controversy when 
Rosenblatt affirmed that the perceptron was the embryo of a new kind of computer 
that would be able to walk, talk, see, write, and even reproduce itself and be 
conscious of its existence. If true, it would have been a powerful tool, and it 
introduced the world to AI. 
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Needless to say, perceptron didn’t realize the expectations of its creator. It soon 
displayed a limited capacity, even in its image-recognition specialization. The 
general disappointment ignited the first AI Winter and the temporary abandon- 
ment of connectionism until the 1980s. 


Connectionism is the approach to machine learning that is based on neuroscience 
as well as the example of biologically interconnected networks. You can retrace 
the root of connectionism to the perceptron. (See the “Specifying the Role of Sta- 
tistics in Machine Learning” section of Chapter 2 for a discussion of the five tribes 
of machine learning.) 


The perceptron is an iterative algorithm that strives to determine, by successive 
and reiterative approximations, the best set of values for a vector, w, which is also 
called the coefficient vector. Vector w can help predict the class of an example when 
you multiply it by the matrix of features, X (containing the information in numeric 
values) and then add it to a constant term, called the bias. The output is a predic- 
tion in the sense that the previously described operations output a number whose 
sign should be able to predict the class of each example exactly. 


The natural specialty of the perceptron is binary classification. However, you can 
use it to predict multiple classes using more models (one for each class guess, a 
training strategy called one-versus-all or OVA). Apart from classification, the 
perceptron can’t provide much more information. For instance, you can’t use it to 
estimate the probability of being exact in predictions. In mathematical terms, the 
perceptron tries to minimize the following cost function formulation, but it does 
so only for the examples that are misclassified (in other words, whose sign doesn’t 
correspond to the right class): 


Error =-} icm y;(x/w +b) 


The formula, which is an example of a cost function as defined in Chapter 10, 
involves only the examples of the matrix X, which, under the current set of w, 
have been assigned a misclassified sign. To understand the function of the 
formula, you have to consider that there are only two classes. The examples from 
the first class are expressed as a +1 value in response to vector y, whereas the 
examples from the other class are consequently coded as —1. The misclassified 
examples are deemed part of the set M (summations consider just the i-th exam- 
ple as being part of M). The formula picks up each of the misclassified examples 
and, in turn, multiplies their features by the vector w, and then adds the bias. 


Because the multiplication of misclassified examples, X, by the weight vector, w, 
within the parenthesis is a multiplication of vectors, you should transpose the x 
vector of features of an example so that the result of the multiplication with w is 
a number. This is an aspect of matrix multiplication covered in Chapter 9. 
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Multiplying two vectors is the same as creating a weighted sum of the values of the 
first vector using the values in the second vector as the weights. Therefore, if x; 
has five features and the vector w has five coefficients, the result of their multipli- 
cation is the sum of all five features, each one first multiplied by its respective 
coefficient. Matrix multiplication makes the procedure compact to express in for- 
mula, but in the end, the operation doesn’t differ from a weighted average. 


After getting the result of the multiplication of vectors, you sum the values to the 
bias and multiply everything to the value that you should have predicted (which is 
+1 for the first class and —1 for the second one). Because you’re working only with 
misclassified examples, the result of the operation is always negative because the 
multiplication of two sign-mismatched values is always negative. 


Finally, after running the same computation for all the misclassified examples, 
you pool all the results together and sum them. The result is a negative number 
that becomes positive because of the negative sign at the head of the formulation 
(which is like multiplying everything by —1). The size of the result increases as the 
number of perceptron errors becomes larger. 


By observing the results carefully, you realize that the formula is devised in a 
smart way (although far from being a miracle). The output is smaller when the 
number of errors is smaller. When you have no misclassification, the summation 
(result) turns to zero. By putting the formula in this form, you tell the computer 
to try to achieve perfect classification and never give up. The idea is that when it 
finds the right values of the vector w, and there aren’t any prediction errors, all 
that’s left to do is to apply the following formula: 


y= sign( Xw +q) 


Running the formula outputs a vector of predictions (y-hat) containing a sequence 
of +1 and -1 values that correspond to the expected classes. 


Touching the nonseparability limit 


The secret to perceptron calculations is in how the algorithm updates the vector w 
values. Such updates happen by randomly picking one of the misclassified exam- 
ples (call it x,) and changing the w vector using a simple weighted addition: 


w=w+N, *y,) 


The Greek letter eta (n) is the learning rate. It’s a floating number between 0 and 1. 
When you set this value near zero, it can limit the ability of the formula to update 
the vector w too much, whereas setting the value near one makes the update pro- 
cess fully impact the w vector values. Setting different learning rates can speed up 
or slow down the learning process. Many other algorithms use this strategy, and 
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FIGURE 12-1: 
The separating 
line of a 
perceptron 
across two 
classes. 


REMEMBER 


lower eta is used to improve the optimization process by reducing the number of 
sudden w value jumps after an update. The trade-off is that you have to wait lon- 
ger before getting the concluding results. 


The update strategy provides intuition about what happens when using a percep- 
tron to learn the classes. If you imagine the examples projected on a Cartesian 
plane, the perceptron is nothing more than a line trying to separate the positive 
class from the negative one. As you may recall from linear algebra, everything 
expressed in the form of y = xb+a is actually a line in a plane. 


Initially, when w is set to zero or to random values, the separating line is just one 
of the infinite possible lines found on a plane, as shown in Figure 12-1. The updat- 
ing phase defines it by forcing it to become nearer to the misclassified point. 
Using multiple iterations to define the errors places the line at the exact border 
between the two classes. 





In spite of being such a smart algorithm, perceptron showed its limits quite soon. 
Apart from being capable of guessing two classes using only quantitative features, 
it had an important limit: If two classes had no border due to mixing, the algo- 
rithm couldn’t find a solution and kept updating itself infinitively. 


If you can’t divide two classes spread on two or more dimensions by any line or 
plane, they’re nonlinearly separable. Overcoming data’s being nonlinearly separa- 
ble is one of the challenges that machine learning has to accomplish in order to 
become effective against complex problems based on real data, not just on artifi- 
cial data created for academic purposes. 


When the nonlinear separability matter came under scrutiny and practitioners 
started losing interest in the perceptron, experts quickly theorized that they could 
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fix the problem by creating a new feature space in which previously inseparable 
classes are tuned to become separable. Thus, the perceptron would be as fine as 
before. Unfortunately, creating new feature spaces is a challenge because it 
requires computational power that’s only partially available to the public today. 
Creating a new feature space is an advanced topic discussed later in the book when 
studying the learning strategies of algorithms, such as neural networks and sup- 
port vector machines. 


In recent years, the algorithm has had a revival thanks to big data: A perceptron, 
in fact, doesn’t need to work with all the data in memory, but it can do fine using 
single examples (updating its coefficient vector only when a misclassified case 
makes it necessary). It’s therefore a perfect algorithm for online learning, such as 
learning from big data an example at a time. 


Growing Greedy Classification Trees 
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Decision trees have a long history. The first algorithm of their kind dates back to 
the 1970s, but if you consider experiments and first original research, the use of 
decision trees goes back even earlier (they are as old as perceptions). As core algo- 
rithms of the symbolists tribe, decision trees have enjoyed a long popularity 
because of their intuitive algorithm. Their output is easily translated into rules 
and is therefore quite understandable by humans. They’re also extremely easy to 
use. All these characteristics make them an effective and appealing no-brainer 
with respect to models that require complex mathematical transformations of the 
input data matrix or extremely accurate tuning of their hyper-parameters. 


Predicting outcomes by splitting data 


Using a sample of observations as a starting point, the algorithm retraces the 
rules that generated the output classes (or the numeric values when working 
through a regression problem) by dividing the input matrix into smaller and 
smaller partitions until the process triggers a rule for stopping. Such retracing 
from particular toward general rules is typical of human inverse deduction, as 
treated by logic and philosophy. In a machine learning context, such inverse rea- 
soning is achieved by applying a search among all the possible ways to split the 
training in-sample and decide, in a greedy way, to use the split that maximizes 
statistical measurements on the resulting partitions. 


An algorithm is greedy when it always chooses its move to maximize the result in 
each step along the optimization process, regardless of what could happen in the 
following steps. In other words, the algorithm looks to maximize the current step 
without looking forward toward achieving a global optimization. 
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The division occurs to enforce a simple principle: Each partition of the initial data 
must make it easier to predict the target outcome, which is characterized by a dif- 
ferent and more favorable distribution of classes (or values) than the original 
sample. The algorithm creates partitions by splitting the data. It determines the 
data splits by first evaluating the features and then the values in the features that 
could bring the maximum improvement of a special statistical measure that plays 
the role of the cost function in a decision tree. 


A number of statistical measurements determine how to make the splits in a deci- 
sion tree. All abide by the idea that a split must improve on the original sample, or 
another possible split, when it makes prediction safer. Among the most used 
measurements are gini impurity, information gain, and variance reduction (for regres- 
sion problems). These measurements operate similarly, so this chapter focuses on 
information gain because it’s the most intuitive measurement and conveys how a 
decision tree can detect an increased predictive ability (or a reduced risk) in the 
easiest way for a certain split. Ross Quinlan created a decision tree algorithm 
based on information gain (ID3) in the 1970s, and it’s still quite popular thanks to 
its recent upgraded version to C4.5. Information gain relies on the formula for 
informative entropy, a generalized formulation that describes the expected value 
from the information contained in a message: 


Entropy = $, -p; log, p; 


In the formula, p is the probability for a class (expressed in the range of o to 1) and 
log, is the base 2 logarithm. Starting with a sample in which you want to classify 
two classes having the same probability (a 50/50 distribution), the maximum 
possible entropy is 


Entropy = -@.5*log,(@.5) -0.5xl0og,(0.5) = 1.0 


However, when the decision tree algorithm detects a feature that can split the 
dataset into two partitions, where the distribution of the two classes is 40/60, the 
average informative entropy diminishes: 


Entropy = -0.4xlog,(0.4) -@.6xlog,(@.6) = 0.97 


Note the entropy sum for all the classes. Using the 40/60 split, the sum is less 
than the theoretical maximum of 1 (diminishing the entropy). Think of the entropy 
as a measure of the mess in data: The less mess, the more order, and the easier it 
is to guess the right class. After a first split, the algorithm tries to split the obtained 
partitions further using the same logic of reducing entropy. It progressively splits 
any successive data partition until no more splits are possible because the sub- 
sample is a single example or because it has met a stopping rule. 
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Stopping rules are limits to the expansion of a tree. These rules work by considering 
three aspects of a partition: initial partition size, resulting partition size, and 
information gain achievable by the split. Stopping rules are important because 
rememper decision tree algorithms approximate a large number of functions; however, noise 
and data errors can easily influence this algorithm. Consequently, depending on 
the sample, the instability and variance of the resulting estimates affect decision 
tree predictions. 





As an example, take a look at what a decision tree can achieve using one of the 
original Ross Quinlan datasets that present and describe the ID3 algorithm in 
Induction of Decision Trees (1986) (http: //dl.acm.org/citation.cfm?id=637969). 
The dataset is quite simple, consisting of only 14 observations relative to the 
weather conditions, with results that say whether it’s appropriate to play tennis. 
The example contains four features: outlook, temperature, humidity, and wind, 
all expressed using qualitative classes instead of measurements (you could express 
temperature, humidity, and wind strength numerically) to convey a more intui- 
tive understanding of how the features relate to the outcome. The following 
example uses R to create a data. frame containing the play tennis data. 


weather <- expand.grid(Outlook = c("Sunny", "Overcast","Rain"), 
Temperature = c("Hot","Mild","Cool"), 
Humidity=c("High","Normal"), Wind=c("Weak", "Strong" ) ) 
response <- c(1, 19, 4, 31, 16, 2, 11, 23, 35, 6, 24, 15, 18, 36) 
play <- as.factor(c("No", "No", "No", "Yes", "Yes", "Yes", 
"Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No")) 
tennis <- data. frame(weather [response, |, play) 


To create a decision tree, the example uses the rpart library and sets the param- 
eters necessary to have a tree fully grown by information gain as the split 
criteria: 


library(rpart) 
tennis_tree <- rpart(play ~ ., data=tennis, method="class", 
parms=list(split="information"), 
control=rpart.control(minsplit=1 ) ) 


After creating the tree, you can inspect it using a simple print command or the 
summary command for a more detailed and verbose report about its construction. 
Different implementations can have different outputs, as you can see from the 
rpart output. 


In addition to rpart, you have other R implementation options for working with 
decision trees, such as the packages tree, party, and a few others that you can 
discover in this blog post: http: //www.r-bloggers .com/a-brief-tour-of-the 
onTHEWwes trees—and-forests/. Python also provides a Scikit-learn described at http: // 
scikit-learn.org/stable/modules/tree.html. However, if the tree isn’t too 
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FIGURE 12-2: 

A visualization of 
the decision tree 
built from the 
play tennis 
dataset. 


complex, a visual representation can immediately reveal how the tree works no 
matter what implementation you use. You can represent trees made by rpart using 
the rpart.plot package. Download the package from CRAN. The paper “rpart.plot: 
Plot rpart Models” by Stephen Milborrow at http: //www.milbo.org/rpart-— 
plot/ describes the package in detail (you can click the rpart plot link on the page 
to begin the download). After installing the package, you run it and plot the 
representation of the tree, as shown in Figure 12-2: 


library(rpart.plot) 
prp(tennis_tree, type=@, extra=1, under=TRUE, compress=TRUE ) 


[yes] Outlook = Snn,Ran 


Humidity = A 
a ia ‘ 


Outlook = T Wind = Str 


S A Wind = Str Outlook A d 
S$ è 6 > 


To read the nodes of the tree, just start from the topmost node, which corresponds 
to the original training data, and then start reading the rules. Note that each node 
has two derivations: The left branch means that the upper rule is true (stated as 
yes in a square box), and the right one means that it is false (stated as no in a 
square box). 


On the right of the first rule, you see an important terminal rule (a terminal leaf), 
in a circle, stating a positive result, Yes, that you can read as play tennis=True. 
According to this node, when the outlook isn’t sunny (Snn) or rainy (Ran), it’s 
possible to play. (The numbers under the terminal leaf show four examples 
affirming this rule and zero denying it.) Please note that you could understand the 
rule better if the output simply stated that when the outlook is overcast, play is 
possible. Frequently, decision tree rules aren’t immediately usable, and you need 
to interpret them before use. However, they are clearly intelligible (and much bet- 
ter than a coefficient vector of values). 


On the left, the tree proceeds with other rules related to Humidity. Again, on the 


left, when humidity is high and outlook is sunny, most terminal leaves are nega- 
tive, except when wind isn’t strong. When you explore the branches on the right, 
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TIP 


you see that the tree reveals that play is always possible when wind isn’t strong, 
or when the wind is strong but it doesn’t rain. 


Pruning overgrown trees 


Even though the play tennis dataset in the previous section illustrates the nuts 
and bolts of a decision tree, it has little probabilistic appeal because it proposes a 
set of deterministic actions (there are no conflicting instructions). Training with 
real data usually doesn’t feature such sharp rules, thereby providing room for 
ambiguity and hopeful likelihood. 


Another, more realistic example, is a dataset describing the survival rates of 
passengers from the RMS Titanic, the British passenger liner that sank in the North 
Atlantic ocean in April 1912 after colliding with an iceberg. Various versions of the 
dataset exist — the R version used in the example is made of cross tabulations of 
gender, age, and survival. The example transforms the tables into a matrix and 
learns rules using the rpart package as previously done with the play tennis 
dataset. 


data(Titanic, package = "datasets" ) 
dataset <- as.data. frame(Titanic) 
library(rpart) 


titanic_tree <- rpart(Survived ~ Class + Sex + Age, 
data=dataset, weights=Freq, method="class", 
parms=list(split="information"), 
control=rpart.control(minsplit=5) ) 

pruned_titanic_tree <- prune(titanic_tree, cp=0.02) 


Decision trees have more variance than bias in their estimations. To overfit the 
data less, the example specifies that the minimum split has to involve at least five 
examples; also, it prunes the tree. Pruning happens when the tree is fully grown. 
Starting from the leaves, the example prunes the tree of branches, showing little 
improvement in the reduction of information gain. By initially letting the tree 
expand, branches with little improvement are tolerated because they can unlock 
more interesting branches and leaves. Retracing from leaves to root and keeping 
only branches that have some predictive value reduces the variance of the model, 
making the resulting rules parsimonious. 


For a decision tree, pruning is just like brainstorming. First, the code generates 
all possible ramifications of the tree (as with ideas in a brainstorming session). 
Second, when the brainstorming concludes, the code keeps only what really works. 
A chart of the tree structure (see Figure 12-3) reveals that only two rules matter in 
survival: gender (being male penalizes survival) and not being in third class (the 
poorest). 
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FIGURE 12-3: 

A visualization of 
the pruned 
decision tree 
build from the 
Titanic dataset. 


library(rpart.plot) 
prp(pruned_titanic_tree, type=@, extra=1, under=TRUE, 
compress=TRUE ) 


Sex = Mal 


Class = 3rd 


No 
1364 367 


106 90 20 254 


Taking a Probabilistic Turn 


REMEMBER 


Naive Bayes, another basic learning algorithm, is more similar to the previously 
discussed perceptron than the decision tree because it is based on a set of values 
being put together to obtain a prediction. As with the perceptron and decision 
trees, Naive Bayes is a historical algorithm used since the 1950s, although under 
different names and slightly different forms. Moreover, Naive Bayes is famed for 
being an effective algorithm for learning from text, and it clearly appeals to the 
Bayesian tribe. Given its simplicity and the fact that it works with little prepro- 
cessing, it has become the baseline for most textual problems in machine learning 
before testing more complex solutions. 


Understanding Naive Bayes 


As with the perceptron, Naive Bayes requires values that are probabilities of an 
outcome given a certain context (a conditional probability). Moreover, you multiply 
the values instead of summing them. Naive Bayes is powered by computations on 
probability, and it consequently requires typical operations on probabilities. 


As seen in Chapter 9, when probabilities are multiplied, it means that the events, 
whose likelihood we are considering, are independent and not influencing each 
other in any way. Such an assumption, though deemed simplistic and naive, is 
frequent in many basic machine learning algorithms because it’s unbelievably 
effective when working with a lot of data. 
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By summing values or multiplying probabilities, you treat each piece of informa- 
tion as a separate contribution to the answer. It’s an unrealistic assumption 
sometimes because reality points to a world of interconnections. However, in spite 
of the lack of realism, Naive Bayes can outperform most complex techniques, as 
described by two researchers from Microsoft, Banko and Brill, in their memorable 
paper “Scaling to Very Very Large Corpora for Natural Language Disambiguation” 
(http: //ucrel.lancs.ac.uk/acl/P/PQ1/P01-10@5.pdf). 


Naive Bayes relates to the Bayes theorem discussed in the “Conditioning chance 
by Bayes’ theorem” section of Chapter 9. It represents a simplified form of the 
theorem itself. To answer the class of an example, the algorithm does the 
following: 

1. Learns the probabilities connecting the features to each of the possible classes. 

2, Multiplies all the probabilities related to each resulting class. 

3. Normalizes the probabilities by dividing each of them by their total sum. 

4. Takes the class featuring the highest probability as the answer. 

For instance, in the previous example of the play tennis dataset, you observe that 
the different distributions of the sunny, overcast, and rain outlooks connect to the 


positive and negative answer of whether to play tennis. Using R can provide you 
with a quick check on that observation: 


print (table(tennis$Outlook, tennis$play) ) 


No Yes 
Sunny 3 2 
Overcast @ 4 
Rain 2 3 


The output shows nine positive responses and five negative ones. By analyzing the 
positive responses, you see that given a positive response, the outlook is sunny 
two times out of nine (probability = 2/9 = 0.22); overcast four times out of nine 
(probability = 4/9 = 0.44); and rainy three times out of nine (probability = 3/9 = 
0.33). You can repeat the same procedure using the negative responses with prob- 
abilities of 3/5, 0/5, and 2/5 respectively for sunny, overcast, and rainy when you 
can’t play tennis. Using the Bayes theorem, you can determine that the probabili- 
ties that you calculated are actually P(E|B), which is the probability that given a 
certain belief (such as whether to play tennis), you have certain evidence (which 
is weather, in this case): 


P(BIE) = PCEB)ePCs) / PCE) 
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The formula provides the answer you need, because it is the probability of a cer- 
tain belief (to play or not to play) given certain evidence (the weather conditions). 
If you estimate the probabilities for every belief, you can choose the belief that has 
the highest probability, thus minimizing the risk of predicting something incor- 
rectly. P(E|B) then becomes critical for estimating the probabilities because P(B) 
is the general probability of a positive or negative answer (the prior probability) 
and it’s easy to determine. In this case, you have nine positive outcomes and five 
negative ones. Thus, P(B) is 9/(9+5) = 0.64 for positive and 0.36 for negative. 


When you have many pieces of evidence, as in this example, P(E|B) is a compound 
of all the single P(E|B) probabilities on hand. This example has sets of probabilities 
for outlook, temperature, humidity, and wind. Putting them together isn’t easy 
unless you presume that they affect the response separately. As mentioned previ- 
ously, probabilities of independent events are simply multiplied together, and the 
overall P(E|B) turns to be the multiplication of all the P(E|B) for each feature. 


It may happen that you don’t have evidence for a response. For instance, in this 
example, you don’t have cases of not playing tennis when the sky is overcast. The 
result is a zero probability, and in a multiplication, a zero probability always 
returns zero no matter what other probabilities are involved. A lack of evidence for 
a response can occur when you don’t sample enough examples. A good practice is 
to modify observed probabilities by a constant, called a Laplace correction, which 
consists of adding fictitious evidence when estimating the probability. Using such 
a correction in this example, the probability of 0/5 would become (0+1)/(5+1) = 0.17. 


P(E) isn’t a big deal for this example, and you should ignore it. The reason P(E) 
doesn’t matter is that it represents the probability of seeing a certain set of fea- 
tures in reality and it naturally varies from example to example (for instance, your 
location could make certain weather conditions rare). However, you’re not com- 
paring probabilities across examples. You’re comparing probabilities within each 
example to determine the most likely prediction for that example. Within the 
same example, the probability of a certain set of evidence is the same because you 
have only that set for every possible outcome. Whether the set of evidence is rare 
doesn’t matter; in the end, you have to predict that example in isolation from the 
others, so you can safely rule out P(E) by making it a value of 1. The following 
example shows that how to use R determines the numbers to plug in the formula- 
tion for getting a prediction given certain weather conditions: 


outcomes <- table(tennis$play) 

prob_outcomes <- outcomes / sum(outcomes ) 

outlook <- t(as.matrix(table(tennis$Outlook, tennis$play) )) 
/ as.vector (outcomes) 

temperature <- t(as.matrix(table(tennis$Temperature, 
tennis$play))) / as.vector (outcomes) 
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humidity <- t(as.matrix(table(tennis$Humidity, 
tennis$play))) / as.vector (outcomes) 

wind <- t(as.matrix(table(tennis$Wind, tennis$play))) / 
(as. vector (outcomes) ) 


After running the previous code snippet, you have all the elements needed to fig- 
ure out a prediction. Pretend that you to need to guess the following condition: 


Outlook = Sunny, Temperature = Mild, Humidity = Normal, 
Wind = Weak 


To obtain the required information, you first compute the probability for a posi- 
tive outcome: 


p_positive <- outlook["Yes", "Sunny" ] 
temperature["Yes", "Mild" ] 
humidity["Yes","Normal"] * wind["Yes","Weak"] x 
prob_outcomes["Yes" ] 


* 
* 


If you print p_positive, you can see that the probability is 0.02821869. Now you 
can check for a negative outcome: 


p_negative <- outlook["No","Sunny"] x 
temperature["No","Mild"] * 
humidity["No","Normal"] * wind["No", "Weak"] x 
prob_outcomes["No"] 


The result in terms of probability for a negative response is 0.006857143. Finally, 
you can obtain the guess using a Boolean check: 


print (p_positive >= p_negative) 


The positive result, TRUE, confirms that, given such conditions, the algorithm 
predicts that you can play tennis. 


Estimating response with Naive Bayes 


Now that you know how it works, Naive Bayes should appear quite simple and 
strongly naive in its assumption. You should also know that multiplying probabil- 
ities is fine. You also need to consider these issues: 


>» Fixing zero probabilities using the Laplace correction 


>> Converting numeric features into qualitative variables because estimating 
probability for classes comprising ranges of numbers is easier 
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>» Using counted features only (values equal to or above zero) — although some 
algorithm variants can deal with binary features and negative values 


>> Imputing values in missing features (when you're missing an important 
probability in the computation), along with removing redundant and irrelevant 
features (keeping such features would make estimation by Naive Bayes more 
difficult) 


In particular, irrelevant features can affect results a lot. When you are working with 
a few examples with many features, an unlucky probability can really skew your 
result. As a solution, you can select features, filtering only the most important ones. 
Chapter 15 discusses this technique using linear models (the presence of useless 
pieces of information also affects linear models). When you have enough examples 
and you spend some time fixing features, Naive Bayes renders effective solutions to 
many prediction problems involving the analysis of textual input such as: 


>> Email spam detection: Allows you to place only useful information in 
your Inbox. 


>> Text classification: No matter the source (online news, tweets, or other 
textual feeds), you can correctly arrange text into the right category (such as 
sports, politics, foreign affairs, and economy). 


>> Text-processing tasks: Lets you perform spelling correction or guess the 
language of a text. 


>» Sentiment analysis: Detects the sentiment behind written text (positive, 
negative, neutral, or one of the basic human emotions). 


As an example of a practical application, you can use R and the klaR library, which 
contains the NaiveBayes function. This library offers an interesting dataset con- 
taining selected features for detecting whether an inbound email should be con- 
sidered spam. Because klarR and kernlab are nonstandard libraries, the first time 
you run the code, you have to install them: 


install.packages(c("klarR", "kernlab") ) 
After doing so, you are ready to run the example code. 


library(klaR) 
data(spam, package = "kernlab") 


Hewlett-Packard Labs collected the dataset and classified 4,601 emails as spam or 
nonspam, using 57 features. You can also find the spam dataset on the free UCI 
machine learning repository at https://archive.ics.uci.edu/ml/datasets/ 
Spambase. 
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If you upload the spam dataset and check its features (using the command 
head(spam), for instance), you notice that some features are words, whereas oth- 
ers point to the presence of certain characters or writing styles (such as capital 
letters). More noticeably, some of the features aren’t integer numbers represent- 
ing counts but rather are floats ranging from o to 100. They represent the pres- 
ence of each feature in the text as a percentage (for instance, the variable 
charDollar represents the percentage of dollar sign characters in the phrase and 
ranges from o to 6). 


Features expressed as percentages of certain words or characters in text represent 
a wise strategy to balance the higher likelihood of finding certain elements if the 
text is long. Using percentages instead of counts normalizes the texts and lets you 
view them as being of the same length. 


Applying a Naïve Bayes model in R requires just a few commands. You can set the 
Laplace correction using the fL parameter (the example keeps it set to zero) and 
defining different a priori probabilities, P(B), using the prior parameter and pro- 
viding a vector of probabilities. In this example, we set the probability of an 
email’s being nonspam to 90 percent: 


set .seed(1234) 

train_idx <- sample(1:nrow(spam), ceiling(nrow(spam)*3/4), 
replace=FALSE) 

naive <- NaiveBayes(type ~ ., data=spam[train_idx,], 
prior = c(@.9,@.1), fl = @) 


The code doesn’t use all the examples at hand, but instead keeps a quarter of them 
for testing the results out-of-sample. NaiveBayes automatically translates the 
numeric features into features suitable for the algorithm, thus leaving little more 
to do than to ask the model to scan the out-of-sample data and generate predic- 
tions that you check in order to know how well the model works. (Don’t worry 
about the warnings you see from confusionMatrix function found in the caret 
library — what you care about is the confusion matrix output.) 


This book uses the caret library for error estimation. The caret library is a power- 
ful library supporting you in many operations for validating and evaluating 
machine learning algorithm, but you must install it first: 


install.packages("caret" ) 


Installation will take some time because the library has many dependencies (it 
requires many other R libraries such as the e1071 library, a package for supporting 
vector machines). After completing the installation, you can proceed with the 
example. 
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library(caret ) 

predictions <- predict(naive, spam[-train_idx, ] ) 

confusionMatrix(predictions$class, 
spam[-train_idx,"type"] ) 


The example shows that NaiveBayes takes longer to predict than to train. The 
increased prediction time happens because training the algorithm involves count- 
ing only the occurrences in the features and stores the results. The real computa- 
tions happen when the algorithm is predicting, making it fast to train and slow to 
predict. Here’s the output you see from this example: 


Prediction nonspam spam 
nonspam 403 24 
spam 293 430 


It seems that, as often happens, an algorithm can catch almost all the spam, but 
does so at the cost of putting some regular email into the spam box. The code can 
report such problems using a score measure such as accuracy, which you can split 
further into accuracy for the positive and the negative classes respectively: 


Accuracy : 9.7243 
Pos Pred Value : 0.9438 
Neg Pred Value : 90.5947 


Catching spam isn’t that difficult; the problem is to avoid discarding important 
emails in the process (false positives, where a positive is a spam email). You see 
this example again later in the book, where you use a more sophisticated approach 
to understand how you can improve the result. 


There are quite a few kinds of Naive Bayes models. The model you just used is 
called multinomial. However, there is also a Bernoulli version, suitable for binary 
indicators (you don’t count words but instead check to determine whether they’re 
present) and a Gaussian version (which expects normally distributed features — 
having both positive and negative values). Python, contrary to R, offers a complete 
range of Naive Bayes models in the Scikit-learn package at http: //scikit-— 
learn.org/stable/modules/naive_bayes.html. 
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Learning from 
Smart and Big 
Data 





IN THIS PART... 

Getting data ready for use 
“Understanding data through similarity 
Workingwithlinestmodels O 
“Workingwith neural networks 
“Employing smart vector machines 


Performing multiple levels of analysis using ensembles 


IN THIS CHAPTER 


Deciding to get more data or start 
preparations 


Fixing bad data like wrong or missing 
values 


Creating new meaningful features 


Compressing and reconstructing any 
redundant information 


Understanding why you should 
beware of outliers 


Preprocessing Data 


hen building a new house, before thinking of any beautiful architecture, 

aesthetic addition, or even furniture designed to beautify it, you need to 

build a solid foundation over which to construct walls. In addition, the 
more difficult the terrain you have to work on, the more time and effort it will 
take. If you neglect to create a sturdy foundation, nothing built on it can withstand 
time and nature for long. 


The same issue exists in machine learning. No matter the level of sophistication 
of the learning algorithm, if you don’t prepare your foundation well — that is, 
your data — your algorithm won’t last long when tested in real data situations. 
You can’t prepare data by just looking at it; you must expend the effort to examine 
it closely. Unfortunately, time spent on cleaning data can take around 80 percent 
of the total time you devote to a machine learning project. 


Preparing data consists of several steps (as detailed in the sections that follow): 


Obtain meaningful data (also called ground truth), which is data that someone 
has correctly measured or labeled. 


Acquire enough data for the learner algorithm to work correctly. You can't tell 
in advance how much data you'll need because it all depends on the algorithm 
you plan to use. Only after testing will you be able to figure out whether 
your estimates are suffering from high bias or variance of the algorithm you 
decided to use. 
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3. Arrange the gathered data into a matrix. 


4. Deal with bad data, such as missing cases (a problem that occurs often), 
distorted distributions, redundancies, and anomalous examples. 


5. Create new features (as needed) that are better suited for your algorithm to 
learn how to map the response. 


Gathering and Cleaning Data 
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Although this book focuses on how machines learn from data, no magic recipe 
exists in the world of algorithms (as the “no free lunch” theorem in Chapter 11 
states) when it comes to data. Even sophisticated and advanced learning functions 
hit the wall and underperform when you don’t support them with the following: 


>» Large enough quantities of data that are suitable for the algorithm you use 


>> Clean, well-prepared data suitable for use in machine learning 


Chapter 11 discusses the role of data quantity in learning when it explains bias and 
variance trade-offs. As a reminder, large quantities of data can prove beneficial to 
learning purposes when the variability of the estimates is a problem, because the 
specific data used for learning heavily influences predictions (the overfitting 
problem). More data can really help because a larger number of examples aids 
machine learning algorithms to disambiguate the role of each signal picked up 
from data and taken into modeling the prediction. 


Besides data quantity, the need for data cleanliness is understandable — it’s just 
like the quality of teaching you get at school. If your teachers teach you only non- 
sense, make erroneous examples, spend time joking, and in other ways don’t take 
teaching seriously, you won’t do well on your examinations no matter how smart 
you are. The same is true for both simple and complex algorithms — if you feed 
them garbage data, they just produce nonsense predictions. 


According to the principle of garbage in, garbage out (GIGO for short), bad data 
can truly harm machine learning. Bad data consists of missing data, outliers, 
skewed value distributions, redundancy of information, and features not well 
explicated. This chapter frames all these problems and shows how to solve them. 


Bad data may not be bad in the sense that it’s wrong. Quite often, bad data is just 
data that doesn’t comply with the standards you set for your data: a label written 
in many different ways; erratic values spilled over from other data fields; dates 
written in invalid formats; and unstructured text that you should have structured 
into a categorical variable. Enforcing rules of data validity in your databases and 


PART 4 Learning from Smart and Big Data 


working on the design of better data tables as well as the exactness of the process 
that stores data can prove of invaluable help for machine learning and let you 
concentrate on solving trickier data problems. 


Repairing Missing Data 


TIP 


Even if you have enough examples at hand for training both simple and complex 
learning algorithms, they must present complete values in the features, without 
any missing data. Having an incomplete example makes connecting all the signals 
within and between features impossible. Missing values also make it difficult for 
the algorithm to learn during training. You must do something about the missing 
data. Most often, you can ignore missing values or repair them by guessing a 
likely replacement value. However, too many missing values render more uncer- 
tain predictions because missing information could conceal any possible figure; 
consequently, the more missing values in the features, the more variable and 
imprecise the predictions. 


Identifying missing data 


As a first step, count the number of missing cases in each variable. When a vari- 
able has too many missing cases, you may need to drop it from the training and 
test dataset. A good rule of thumb is to drop a variable if more than 90 percent of 
its instances are missing. 


Some learning algorithms do not know how to deal with missing values and report 
errors in both training and test phases, whereas other models treat them as zero 
values, causing an underestimation of the predicted value or probability (it’s just 
as if part of the formula isn’t working properly). Consequently, you need to replace 
all the missing values in your data matrix with some suitable value for machine 
learning to happen correctly. 


Many reasons exist for missing data, but the essential point is whether the data 
is missing randomly or in a specific order. Random missing data is ideal because 
you can guess its value using a simple average, a median, or another machine 
learning algorithm, without too many concerns. Some cases contain a strong bias 
toward certain kinds of examples. For instance, think of the case of studying the 
income of a population. Wealthy people (for taxation reasons, presumably) tend 
to hide their true income by reporting to you that they don’t know. Poor people, 
on the other hand, may say that they don’t want to report their income for fear 
of negative judgment. If you miss information from certain strata of the popula- 
tion, repairing the missing data can be difficult and misleading because you may 
think that such cases are just like the others. Instead, they are quite different. 
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Therefore, you can’t simply use average values to replace the missing values — you 
must use complex approaches and tune them carefully. Moreover, identifying cases 
that aren’t missing data at random is difficult because it requires a closer inspec- 
tion of how missing values are associated with other variables in the dataset. 


When data is missing at random, you can easily repair the empty values because 
you obtain hints to their true value from other variables. When data isn’t missing 
at random, you can’t get good hints from other available information unless you 
understand the data association with the missing case. Therefore, if you have to 
figure out missing income in your data, and it is missing because the person is 
wealthy, you can’t replace the missing value with a simple average because you'll 
replace it with a medium income. Instead, you should use an average of the income 
of wealthy people as a replacement. 


When data isn’t missing at random, the fact that the value is missing is informa- 
tive because it helps track down the missing group. You can leave the chore of 
looking for the reason that it’s missing to your machine learning algorithm by 
building a new binary feature that reports when the value of a variable is missing. 
Consequently, the machine learning algorithm will figure out the best value to use 
as a replacement by itself. 


Choosing the right replacement strategy 


You have a few possible strategies to handle missing data effectively. Your strat- 
egy may change if you have to handle missing values in quantitative (values 
expressed as numbers) or qualitative features. Qualitative features, although also 
expressed by numbers, are in reality referring to concepts, so their values are 
somewhat arbitrary and you cannot meaningfully take an average or other com- 
putations on them. 


When working with qualitative features, your value guessing should always pro- 
duce integer numbers, based on the numbers used as codes. Common strategies 
for missing data handling are as follows: 


>> Replace missing values with a computed constant such as the mean or 
the median value. If your feature is a category, you must provide a specific 
value because the numbering is arbitrary, and using mean or median doesn't 
make sense. Use this strategy when the missing values are random. 


>> Replace missing values with a value outside the normal value range of 
the feature. For instance, if the feature is positive, replace missing values 
with negative values. This approach works fine with decision tree-based 
algorithms (such as those explained in the previous chapter) and qualitative 
variables. 


PART 4 Learning from Smart and Big Data 


>> Replace missing values with 0, which works well with regression models 
and standardized variables (a topic found in Chapter 15). This approach is 
also applicable to qualitative variables when they contain binary values. 


>> Interpolate the missing values when they are part of a series of values 
tied to time. This approach works only for quantitative values. For instance, if 
your feature is daily sales, you could use a moving average of the last seven 
days or pick the value at the same time the previous week. 


>> Impute their value using the information from other predictor features 
(but never use the response variable). Particularly in R, there are specialized 
libraries like missForest (https: //cran.r—-project.org/web/packages/ 
missForest/index.htm1), MICE (https: //cran.r—project.org/web/ 
packages/mice/index.html1), and Amelia Il (http: //gking. harvard.edu/ 
amelia) that can do everything for you. 


Another good practice is to create a new binary feature for each variable whose 
values you repaired. The binary variable will track variations due to replacement 
or imputing with a positive value, and your machine learning algorithm can figure 
out when it must make additional adjustments to the values you actually used. 


In Python, missing values are made possible only using the ndarray data struc- 
ture from the NumPy package. Python marks missing values with a special value 
that appears printed on the screen as NaN (Not a Number). The DataFrame data 
structure from the pandas package offers methods for both replacing missing val- 
ues and dropping variables. 


The following Python example demonstrates how to perform replacement tasks. It 
begins by creating a dataset of 5 observations and 3 features, named “A”, “B”, “C”: 


import pandas as pd 

import numpy as np 

data = pd.DataFrame([[1,2,np.nan],[np.nan,2,np.nan], 
[3,np.nan,np.nan],[np.nan,3,8], 
[5,3,np.nan]],columns=['A','B', 'C']) 

print(data,'\n') # prints the data 

# counts NaN values for each feature 

print(data.isnull().sum(axis=@) ) 


A B C 
Q@ 14 2 NaN 
1 NaN 2 NaN 
2 3 NaN NaN 
3 NaN e & 
4 5 3 NaN 
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dtype: int64 


Because feature C has just one value, you can drop it from the dataset. The code 
then replaces the missing values in feature B with a medium value and interpo- 
lates the value in feature A because it displays a progressive order. 


# Drops definitely C from the dataset 

data.drop('C', axis=1, inplace=True) 

# Creates a placeholder for B's missing values 
data['missing_B'] = data['B'].isnull().astype(int) 
# Fills missing items in B using B's average 
data['B'].fillna(data['B'].mean(), inplace=True) 

# Interpolates A 
data['A'].interpolate(method='linear', inplace=True) 
print(data) 


A B missing_B 
© 1 2.0 Q 
qd A 8) (2) 
203929 1 
3 4 3.0 (2) 
4 5 3.0 (2) 


The printed output is the final dataset. Be sure to note that the mean of B isn’t an 
integer value, so the code converted all B values to floating numbers. This approach 
makes sense if B is numeric. If it were a category, and the numbering were mark- 
ing a class, the code should have filled the feature using the command data['B']. 
fillna(data['B'].mode().iloc[@], inplace=True), which uses the mode, 
that is, the first most frequent value in the series. 


In R, missing values appear marked as NA when printed or summarized. Both 
languages provide special ways to locate and deal with empty values. After you’ve 
located them, you have to decide whether to replace or remove them. To repli- 
cate the Python example in R, you need to install the zoo package for your plat- 
form (https: //cran.r-project .org/web/packages/zoo/index.htm1) to create 
interpolations: 


install.packages(pkgs='z0o', dependencies=TRUE ) 
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After installing the zoo package, you can create a data frame and replace the miss- 
ing values using the same strategy as before: 


library(zoo) 

df <- data. frame(A=c(1,NA,3,NA,5), 
B=c(2,2,NA,3,3), 
C=c(NA,NA,NA,8,NA)) 


print(df) 
Nh B IG 
1 1 2NA 
2 NA 2 NA 
3 3 NA NA 
4NA 3 8 
5 5 3 NA 


df <- subset(df, select = c('A','B')) 
df['m_B'] <- as.numeric(is.na(df$B) ) 
df$Blis.na(df$B)] <- mean(df$B, na.rm=TRUE ) 
df$A <- na.approx(df$A) 

print(df) 


or ONA 
ork ONE D 
WOWNDN N 
eCoueonew 
© O e OOU 


As shown in the example, sometimes you can’t do much with examples that have a 
lot of missing values in their features. In such cases, if the example is for training 
(test examples should not be removed, instead), remove it from the set (a procedure 
TIP called listwise deletion) so that the incomplete cases won’t affect learning. If 
instead, the example is part of your test, you shouldn’t remove it and use it to get an 
evaluation of how well your machine learning algorithm handles such situations. 


Transforming Distributions 


Although statistics are based on the expectation that features have certain value 
distributions, machine learning generally doesn’t have such constraints. A 
machine learning algorithm doesn’t need to know beforehand the type of data 
distribution it will work on, but learns it directly from the data used for training. 
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In probability, a distribution is a table of values or a mathematical function that 
links every possible value of a variable to the probability that such value could occur. 
Probability distributions are usually (but not solely) represented in charts whose 
abscissa axis represents the possible values of the variable and whose ordinal axis 
represents the probability of occurrence. Most statistical models rely on a normal 
distribution, a distribution that is symmetric and has a characteristic bell shape. 


Even when using learning algorithms lent from statistics, you never need to 
transform an actual distribution to resemble a normal distribution or any other 
notable statistical distribution (such as uniform or Poisson distributions). Machine 
learning algorithms are usually smart enough to find out how to deal with any 
distribution present in the features by themselves. However, even if transforming 
the actual distributions isn’t necessary for a machine learning algorithm to work 
properly, it can still be beneficial for these reasons: 


>> To make the cost function minimize better the error of the predictions 


>> To make the algorithm converge properly and faster 


You can apply transformations to the response variable to minimize the weight of 
any extreme case. When you predict values (as in a regression problem), and some 
of the values are too extreme with respect to the majority of values, you may apply 
transformations that tend to shorten the distance between values. Consequently, 
when striving to minimize the error, the algorithm won’t focus too much on the 
extreme errors and will obtain a general solution. In such cases, you normally 
choose the logarithmic transformation, but this transformation requires positive 
values. If you work with just positive numbers and the problem is the zero, add 1 
to your values so that none of them will be zero. (The result of log(1) is actually 
zero, making it a convenient new starting point.) 


If you can’t use a logarithmic transformation, apply a cubic root, which preserves 
the sign. In certain cases (when the extreme values are huge), you may want to apply 
the inverse transformation instead (that is, divide 1 by the value of your response). 


Some machine learning algorithms based on the gradient descent or on distance 
measures (like the algorithms described in Chapter 14, such as K-means and 
K-Nearest Neighbors) are quite sensitive to the scale of the numeric values pro- 
vided. Consequently, in order for the algorithm to converge faster or to provide a 
more exact solution, rescaling the distribution is necessary. Rescaling mutates the 
range of the values of the features and can affect variance, too. You can perform 
features rescaling in two ways: 


» Using statistical standardization (z-score normalization): Center the mean 
to zero (by subtracting the mean) and then divide the result by the standard 
deviation. After such transformation, you find most of the values in the range 
from -3 to +3. 
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>> Using the min-max transformation (or normalization): Remove the 
minimum value of the feature and then divide by the range (maximum value 
minus minimum value). This act rescales all the values from 0 to 1. It’s 
preferred to standardization when the original standard deviation is too small 
(original values are too near, as if they all stick together around the mean) or 
when you want to preserve the zero values in a sparse matrix (a concept 
discussed in Chapter 9, which explores the math of machine learning). 


Both R and Python provide functions for distribution transformation. Apart from 
the mathematical functions log and exp, which are immediately available in R and 
require the import of the NumPy package in Python, standardization and min- 
max normalization require a little more work. 


In R, you achieve standardization using the scale function (try help(scale) on R), 
but min-max normalization requires you to define your own function: 


min_max <- function(x) 
{return ((x - min(x, na.rm=TRUE)) / 
(max(x, na.rm=TRUE) - min(x, na.rm=TRUE)))} 


In Python, you use functions and classes from Scikit-learn module preprocessing, 
such as: sklearn.preprocessing.scale, sklearn.preprocessing.Standard-— 
Scaler, and sklearn.preprocessing.MinMaxScaler (you can find more on the 
subject at http: //scikit—learn.org/stable/modules/preprocessing.htm1). 


Part 5 of the book provides some examples of standardization and normalization. 
These examples show how to apply machine learning to real data problems and 
thus deal with data preprocessing as well. 


Creating Your Own Features 


Sometimes the raw data you obtain from various sources won’t have the features 
needed to perform machine learning tasks. When this happens, you must create 
your own features in order to obtain the desired result. Creating a feature doesn’t 
mean creating data from thin air, as you find in the sections that follow. You cre- 
ate new features from existing data. 


Understanding the need to create features 


One great limitation of machine learning algorithms is that it can be impossible to 
guess a formula that could link your response to the features you’re using. Some- 
times this inability to guess happens because you can’t map the response using 
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the information you have available (meaning that you don’t have the right infor- 
mation). In other cases, the information you provided doesn’t help the algorithm 
learn properly. For instance, if you’re modeling the price of real estate properties, 
the surface of the land is quite predictive because larger properties tend to cost 
more. But if instead of the surface, you provide your machine learning algorithm 
with the length of the sides of the land (the latitude and longitude coordinates of 
its corners), your algorithm may not figure out what to do with the information 
you provided. Some algorithms will manage to find the relationship between the 
features, but most algorithms won’t. 


The answer to this problem is feature creation. Feature creation is that part of 
machine learning that is considered more an art than a science because it implies 
human intervention in creatively mixing the existing features. You perform this 
task by means of addition, subtraction, multiplication, and ratio to generate new 
derived features with more predictive power than the originals. 


Knowing the problem well and figuring out how a human being would solve it is 
part of feature creation. So, connecting to the previous example, the fact that land 
surface connects to the property price is common knowledge. If surface is missing 
from your features when trying to guess the value of a property, you can recover 
such information from the existing data — and doing so increases the perfor- 
mance of the predictions. Regardless of whether you rely on common sense, com- 
mon knowledge, or specialized expertise, you can do a lot for your machine 
algorithm if you first figure out what information should work the best for the 
problem and then try to have it available or derive it from among your features. 


Creating features automatically 


You can create some new features automatically. One way to achieve automatic 
feature creation is to use polynomial expansion. Specific ways are available to 
achieve polynomial expansion so that you create features automatically in both R 
and Python. You see some detailed examples when working with regression mod- 
els and support vector machines later in the book. For the time being, you need to 
grasp the concepts behind polynomial expansion. 


In polynomial expansion, you automatically create interactions between features 
as well as create powers (for instance, computing the square of a feature). Interac- 
tions rely on multiplication of the features. Creating a new feature using multipli- 
cation helps to keep track of how features tend to behave as a whole. Therefore, it 
helps to map complex relationships between your features that can hint at special 
situations. 


A great example of an interaction is the noise emitted from a car and the price of 
the car. Consumers don’t appreciate noisy cars unless they buy a sports car, in 
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which case the engine noise is a plus that reminds the owner of the car’s power. It 
also makes bystanders notice the cool car, so noise plays a great role in showing 
off because noise will certainly attract others’ attention. On the other hand, noise 
when driving a family car is not all that cool. 


In a machine learning application, in trying to predict the rate of preference for a 
certain car, features such as noise and the price of the car are predictive by them- 
selves. However, multiplying the two values and adding them to the set of features 
can unequivocally hint to a learning algorithm that the target is a sports car (when 
you multiply high noise levels by a high price). 


Powers help by creating nonlinear relations between the response and the fea- 
tures, hinting at specific situations. As another example, imagine that you have to 
predict a person’s yearly expenses. Age is a good predictor because as people grow 
old and mature, their life and family situation change, too. Students start out poor 
but then find work and can build a family. From a general point of view, expenses 
tend to grow as does age until a certain point. Retirement usually marks a point at 
which expenses tend to diminish. Age contains such information, but it’s a feature 
that tends to grow, and relating expenses to its growth doesn’t help to describe 
the inversion that occurs at a certain age. Adding the squared feature helps create 
a counter effect to age itself, which is small at the beginning but grows quickly 
with age. The final effect is a parabola, with an initial growth characterized by a 
peak in expenses at a certain age, and then a decrease. 


As mentioned initially, knowing in advance such dynamics (noise and sports car, 
consumption and elder age) can help you create the right features. But if you don’t 
know these dynamics in advance, polynomial expansion will automatically create 
them for you because, given a certain order, it will create interactions and powers 
of that order. The order will point out the number of multiplications and the max- 
imum power to apply to the existing features. So a polynomial expansion of order 
2 raises all the features to the second power and multiplies every single feature by 
all the others. (You get the multiplication of all the combinations of two features.) 
Clearly, the higher the number, the more new features will be created, but many 
of them will be redundant and just contribute to making your machine learning 
algorithm overfit the data. 


When using polynomial expansion, you have to pay attention to the explosion of 
features you are creating. Powers increase linearly, so if you have five features and 
you need an expansion of order 2, each feature is raised by up to the second power. 
Increasing the order of one just adds a new power feature for each original feature. 
Instead, interactions increase based on combinations of the features up to that 
order. In fact, with five features and a polynomial expansion of order 2, all ten 
unique combinations of the coupling of the features are created. Increasing the 
order to 3 will require the creation of all the unique combinations of two variables, 
plus the unique combinations of three variables, that is, 20 features. 
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Ideally, in machine learning you can get the best results when your features don’t 
completely correlate with each other and each one has some predictive power with 
respect to the response you’re modeling. In reality, your features often do corre- 
late with each other, displaying a high degree of redundancy in the information 
available to the dataset. 


Having redundant data means that the same information is spread across multiple 
features. If it’s exactly the same information, it represents a perfect collinearity. 
If, instead, it’s not exactly the same information but varies in some way, you have 
collinearity between two variables or multicollinearity between more than two 
variables. 


Redundant data is a problem that statistical theory created solutions to address 
long ago (because statistical computations can suffer a lot from multicollinearity). 
This chapter presents the topic under a statistical point of view, illustrating using 
the concepts of variance, covariance, and correlation. You can imagine each feature 
as bearing different informative components, mixed in different proportions: 


>> Unique variance: The redundancy is unique to a particular feature, and when 
correlated or associated with the response, it can add a direct contribution in 
the prediction of the response itself. 


>> Shared variance: The redundancy is common with other features because of 
a Causality relationship between them (as discussed in Chapter 9). In this case, 
if the shared information is relevant to the response, the learning algorithm 
will have a difficult time choosing which feature to pick up. And when a 
feature is picked up for its shared variance, it also brings along its specific 
random noise. 


>> Random noise component: Information due to measurement problems or 
randomness that isn’t helpful in mapping the response but that sometimes, 
by mere chance (yes, luck or misfortune is part of being random), can appear 
related to the response itself. 


Unique variance, shared variance, and random noise fuse together and can’t be 
separated easily. Using feature selection, you reduce the impact of noise by select- 
ing the minimum set of features that work best with your machine learning algo- 
rithm. Another possible approach is based on the idea that you can fuse that 
redundant information together using a weighted average, thus creating a new 
feature whose main component is the shared variance of multiple features, and its 
noise is an average of previous noise and unique variance. 
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For instance, if A, B, and C share the same variance, by employing compression you 
can obtain a component (so it is called a new feature) made up of the weighted 
summation of the three features such as 0.5*A+0.3*B+0.2*C. You decide the weights 
on the basis of a particular technique called singular value decomposition (SVD). 


SVD has various applications, not just in compressing data but also in finding 
latent factors (hidden features in our data) and in recommender systems, which 
are systems for discovering what someone might like in terms of products or films 
based on previous selections. Chapter 21 presents a detailed explanation of the 
nuts and bolts of SVD for recommender systems. For compression purposes, the 
chapter instead illustrates a technique called principal components analysis (PCA), 
which uses parts of the SVD outputs. 


PCA works simply and straightforwardly: It takes as an input a dataset and returns 
a new, reconstructed dataset of the same shape. In this new dataset, all the fea- 
tures, called components, are uncorrelated, and the most informative components 
appear at the beginning of the dataset. 


PCA also offers a report of how each component equates to the initial dataset. By 
summing the informative value of the new components, you may find that a few 
components express 90 percent or even 95 percent of the original information. 
Taking just those few components is equivalent to using the original data, thus 
achieving a compression of your data by removing redundancies and reducing the 
number of features. 


As an example, the following example refers to the Boston dataset and uses 
Python’s Scikit implementation of PCA. R has many equivalent functions, the 
most popular being princomp, which you can learn about by using the 
help(princomp) command to obtain more information and some examples of its 
usage. Here is the Python code snippet for testing the effectiveness of a PCA: 


from sklearn.datasets import load_boston 
from sklearn.decomposition import PCA 
from sklearn.preprocessing import scale 
import numpy as np 

boston = load_boston() 

X, y = boston.data, boston.target 

pea = PCA(). fit(X) 


After calculating the PCA, the example proceeds to print the informative power of 
this new, reconstructed dataset: 


print (' '.join(['%5i'%(k+1) for k in range(13)])) 


print (' '.join(['----- "]*«13)) 
print (' '.join(["%@.3f" % (variance) for variance 
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in pca.explained_variance_ratio_] )) 
print (' '.join(["%@.3f" % (variance) for variance 
in np.cumsum(pca.explained_variance_ratio_)])) 


1 2 3 4 5 6 T 8 om 





0.806 0.163 0.021 0.007 0.001 0.001 0.000 0.000 0.000 ... 
0.806 0.969 0.990 0.997 0.998 0.999 1.009 1.000 1.000 ... 


In the printed report, the thirteen components (only nine of which appear in the 
book) account for a cumulative dataset that exceeds 85 percent of the original 
when taking into account six components out of 13 and 95 percent with nine com- 
ponents. Using a reconstructed dataset with fewer components than the number 
of the original features often proves beneficial to the machine learning processes 
by reducing memory usage and computation time and by containing the variance 
of the estimates, thus assuring the stability of the results. 
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Sometimes while exploring your data, you may stumble upon values that differ 
from the values you expected in the sense that they are too high or too low, or 
simply that they’re unusual and highly improbable. These are outliers, also called 
anomalies, when you’re certain that they are so because something went wrong 
with them. Instead, if they are just unusual because it’s the first time that you’ve 
seen such values, but they are legitimate information, they’re novelties, a new kind 
of example. 


In some cases, an example is an outlier because it possesses a unique and unusual 
combination of values. Sometimes it’s an association with an impossible or highly 
unlikely value (for example, you find an infant who has a university degree in a 
dataset). Sometimes a subtle mix of many values makes the example rare (so rare 
that it can’t be true). 


Outliers are a problem when learning from data. Because machines learn by 
observing examples and extracting rules and patterns from them, a weird case can 
prove difficult to understand and force the algorithm to reconsider what it has 
learned so far. In particular, many cost functions work based on the difference 
between the observed value and the expected one. A value that exists far outside 
the expected range will have a huge deviance from the expected value, leading the 
learning process to adapt to the anomaly abruptly by undervaluing the regular 
values (which instead produce comparatively slight deviations). 
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Not all machine learning algorithms are sensitive to outliers, though, and there 
are some cautionary steps that you can take when an algorithm is learning. For 
example, you can slowly update the parameters that the algorithm is learning so 
that it doesn’t immediately take for granted every encountered value, but instead 
only the most encountered values. It’s a good practice to deal with the unexpected 
before the learning training starts by adopting a good data audit style, even if you 
have provided remedies on the learning side. 


Manifest outliers skew data distributions and deeply affect your descriptive sta- 
tistics. The mean figure is usually so deviated in the presence of outliers that 
you’ll notice a large difference between the mean and the median values. In addi- 
tion, outliers affect relationships between variables and could weaken or 
strengthen an association between two variables, thus causing further confusion 
in your learning algorithm. 


You also have subtle outliers, those that don’t clearly affect your distributions but 
populate otherwise empty parts of the data space, forcing the learning algorithm 
to adapt to their presence. In reality, adaptation to such cases is equal to learning 
noisy information, thus increasing the variance of estimates. 


A good strategy for finding manifest outliers is to observe some key descriptive 
statistics and practice some exploratory data analysis (EDA), a term coined by John 
Tukey, an American statistician who wrote about the need for more integration 
between graphical analysis and descriptive measures in his 1977 book Exploratory 
Data Analysis. This is the univariate approach. For instance, as previous sections in 
this chapter discuss, a large difference between the mean and the median is a good 
signal that something is wrong with your data. You build the mean by summing 
all the numeric values in a feature, so high or low values enter directly into its 
computation. You build the median by ordering the feature’s values and then 
picking the middle one. (If you have a pair number of examples, you just take the 
average of the two middle observations.) Order is actually unaffected by high or 
low values at the extremities. Many other statistics rely on ordering and are called 
robust because of their resistance to anomalies. 


With regard to the problem of measuring the association between variables, the 
chapters that follow explain the correlation measure in detail. This measure is 
based on the mean and is affected by outlying observations. With respect to mea- 
sures of association, you also have a choice of robust association measures, such 
as Spearman’s ranking correlation, a good choice when you want to check whether 
the rankings of two measures correspond to one another. 


Another measure is the InterQuartile Range (IQR), which is based on the ordering 
of examples because it derives from the difference between the value of the fea- 
tures at the 75 percent of the ordering minus the value at the 25 percent. By cal- 
culating IQR, you get the value range that comprised the central 50 percent of your 
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FIGURE 13-1: 


A boxplot of the 
LSTAT feature 
from the Boston 
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dataset. 


examples. Empirically, adding to the upper boundary and removing from the 
lower boundary of the IQR, its value multiplied by 1.5 should give you a rough 
estimate of how far you can extend the feature’s range before figuring out that the 
values you are really dealing with are too high or too low. Such extension of the 
IQR measure is the basis of the boxplot, a visual representation devised by John 
Tukey (which is why some sources call it the Tukey boxplot). In a boxplot, you can 
find the median and then the IQR boundaries described as a box, followed by the 
extension of IQR*1.5 in both directions as a kind of whiskers. Finally, all examples 
outside the whiskers are plotted, which hints to you about suspected outliers. As 
an example, in Python you can use the method boxplot from a Pandas dataframe, 
but R has an analogous function called boxplot (see Figure 13-1 for the output 
from this example): 


from sklearn.datasets import load_boston 

import pandas as pd 

amatplotlib inline 

boston = load_boston( ) 

X, y = boston.data, boston.target 

X = pd.DataFrame(X, columns=boston. feature_names ) 
X.boxplot('LSTAT' ,return_type='axes' ) 





Plotting a boxplot of the LSTAT feature reveals how some values beyond the value 
of 30 are actually suspected of being outliers. Using IQR and boxplots, you can dis- 
cover most of the outliers, but unfortunately not all of them. Some can hide really 
well in your data. Another strategy based on visual discovery is to use scatterplots, 
which plot data in a bidimensional chart using your variables two at a time. 
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Plotting all combinations of your features can be very time consuming, though. 
A solution can be to create a matrix of scatterplots, but if you have too many vari- 
ables, the problem of the complexity of the representation won’t go away. In such 
a case, restructuring and compressing your data might provide an efficient way to 
reduce the complexity. Principal component analysis combines variables and in 
doing so removes redundancies and emphasis problems that touch more variables 
at one time. 


The first two or three components of a PCA contain the most information from 
your data. You can easily plot them (using just a few charts) and use the output to 
discover outliers. In addition, you shouldn’t neglect the last two components 
because they store the unique, residual information available in the data, and they 
surprisingly can reveal less evident outliers. For instance, in the following example 
using Python and the Boston dataset, you can spot only two outlying groups by 
observing the last two components. Figure 13-2 shows the output from the first 
two components and Figure 13-3 shows the output from the last two components. 


from sklearn.decomposition import PCA 
from sklearn.preprocessing import scale 
pca = PCA() 

pea. fit(scale(X) ) 

C = pca.transform(scale(X) ) 


import matplotlib.pyplot as plt 

plt.scatter(C[:,0],C[:,1], s=2**7, edgecolors='white', 
alpha=0.85, cmap='autumn' ) 

plt.grid() # adds a grid 

plt.xlabel('Component 1') # adds label to x axis 

plt.ylabel('Component 2') # adds label to y axis 


Component 2 


FIGURE 13-2: 

A scatterplot of 
the first two 
components of a 
PCA of the Boston 
dataset. Component 1 
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plt.scatter(C[:,11],C[:,12], s=2**7, edgecolors='white', 
alpha=0.85, cmap='autumn' ) 

plt.grid() # adds a grid 

plt.xlabel('Component 12') # adds label to x axis 

plt.ylabel('Component 13') # adds label to y axis 
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FIGURE 13-3: 
A scatterplot of 
the last two 
components of a 
PCA of the Boston ‘=1.5 -1.0 -0.5 0.0 0.5 10 15 
dataset. Component 12 
An alternative solution to PCA is to use clustering techniques, such as K-means. 
You find this technique discussed in the next chapter. 
TIP 
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IN THIS CHAPTER 


Understanding differences between 
examples 


Clustering data into meaningful 
groups 


Classifying and regressing after 
looking for data neighbors 


Grasping the difficulties of working 
in a high-dimensional data space 


Leveraging Similarity 


rose is a rose. A tree is a tree. A car is a car. Even though you can make 

simple statements like this, one example of each kind of item doesn’t suf- 

fice to identify all the items that fit into that classification. After all, many 
species of trees and many kinds of roses exist. If you evaluate the problem under 
a machine learning framework in the examples, you find features whose values 
change frequently and features that somehow systematically persist (a tree is 
always made of wood and has a trunk and roots, for instance). When you look 
closely for the features’ values that repeat constantly, you can guess that certain 
observed objects are of much the same kind. 


So, children can figure out by themselves what cars are by looking at the features. 
After all, cars all have four wheels and run on roads. But what happens when a 
child sees a bus or a truck? Luckily, someone is there to explain the big cars and 
open the child’s world to larger definitions. In this chapter, you explore how 
machines can learn by exploiting similarity in 


>> A supervised way: Learning from previous examples. For example, a car has 
four wheels, therefore if a new object has four wheels, it could be a car. 


>> An unsupervised way: Inferring a grouping without any label to learn from. 
For example, a list of items all have roots and are made of wood, so they 
should go into the same group even though they lack a name. 
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Both our topic algorithms, K-means, an unsupervised clustering algorithm, and 
K-Nearest Neighbors, a supervised regression and classification algorithm, work by 
leveraging similarities among examples. They offer a good glance at the advantages 
and disadvantages of ordering the world in items that are more or less similar. 


Measuring Similarity between Vectors 


FIGURE 14-1: 
Examples of 
values plotted as 
points on a chart. 


You can easily compare examples from your data using calculations if you think of 
each of them as a vector. The following sections describe how to measure similar- 
ity between vectors to perform tasks such as computing the distance between 
vectors for learning purposes. 


Understanding similarity 


In a vector form, you can see each variable in your examples as a series of coordi- 
nates, with each one pointing to a position in a different space dimension, as 
shown in Figure 14-1. If a vector has two elements, that is, it has just two vari- 
ables, working with it is just like checking an item’s position on a map by using 
the first number for the position in the East-West axis and the second on the 
North-South axis. 





0.5 10 15 2.0 25 3.0 3.5 


For instance, the numbers between parentheses (1,2) (3,2), and (3,3) are all exam- 
ples of points. Each example is an ordered list of values (called a tuple) that can be 
easily located and printed on a map using the first value of the list for x (the hori- 
zontal axis) and the second for y (the vertical axis). The result is a scatterplot, and 
you find examples of them in this chapter and the entire book. 
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If your dataset, in matrix form, has many numeric features (the columns), ideally 
the number of the features represents the dimensions of the data space, while the 
rows (the examples) represent each point, which mathematically is a vector. When 
your vector has more than two elements, visualization becomes troublesome 
because representing dimensionalities above the third isn’t easy (after all, we live 
in a three-dimensional world). However, you can strive to convey more dimen- 
sionalities by some expedient, such as by using size, shape, or color for other 
dimensions. Clearly, that’s not an easy task, and often the result is far from being 
intuitive. However, you can grasp the idea of where the points would be in your 
data space by systematically printing many graphs while considering the dimen- 
sions two by two. Such plots are called matrices of scatterplots. 


Don’t worry about multidimensionality. You extend the rules you learned in two 
or three dimensions to multiple dimensions, so if a rule works in a bidimensional 
space, it also works in a multiple one. Therefore, all the examples first refer to 
bidimensional examples. 


Computing distances for learning 


An algorithm can learn by using vectors of numbers that use distance measure- 
ments. Often the space implied by your vectors is a metric one that is a space 
whose distances respect certain specific conditions: 


>> No negative distances exist, and your distance is zero only when the starting 
point and ending point coincide (called nonnegativity) 


>» The distance is the same going from a point to another and vice versa (called 
symmetry) 


>» The distance between an initial point and a final one is always greater than, or 
at worse the same as, the distance going from the initial to a third point and 
from there to the final one (called triangle inequality — which means that there 
are no shortcuts) 


Distances that measure a metric space are the Euclidean distance, the Manhattan 
distance, and the Chebyshev distance. These are all distances that can apply to 
numeric vectors. 


Euclidean distance 


The most common is the Euclidean distance, also described as the 12 norm of two 
vectors (you can find a discussion of l1, 12, and linf norms at https: //rorasa. 
wordpress. com/2012/@5/13/1@-norm-11-norm-12-norm-1-infinity-—norm/). 
In a bidimensional plane, the Euclidean distance refigures as the straight line 
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connecting two points, and you calculate it as the square root of the sum of the 
squared difference between the elements of two vectors. In the previous plot, 
the Euclidean distance between points (1,2) and (3,3) can be computed in R as 
sqrt((1-3)A2+(2-3)A2), which results in a distance of about 2.236. 


Manhattan distance 


Another useful measure is the Manhattan distance (also described as the l1 norm 
of two vectors). You calculate the Manhattan distance by summing the absolute 
value of the difference between the elements of the vectors. If the Euclidean dis- 
tance marks the shortest route, the Manhattan distance marks the longest route, 
resembling the directions of a taxi moving in a city. (The distance is also known 
as taxicab or city-block distance.) For instance, the Manhattan distance between 
points (1,2) and (3,3) is abs(1—3) and abs(2-3), which results in 3. 


Chebyshev distance 


The Chebyshev distance or maximum metric takes the maximum of the absolute 
difference between the elements of the vectors. It is a distance measure that can 
represent how a king moves in the game of chess or, in warehouse logistics, the 
operations required by an overhead crane to move a crate from one place to 
another. In machine learning, the Chebyshev distance can prove useful when you 
have many dimensions to consider and most of them are just irrelevant or redun- 
dant (in Chebyshev, you just pick the one whose absolute difference is the largest). 
In the example used in previous sections, the distance is simply 2, the max 
between (1-3) and abs(2-3). 


Using Distances to Locate Clusters 
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Working with a well-ordered space, you naturally find similar items next to each 
other, such as books about the same topic in a library. In a library, similar books 
stand in the same bookshelf, in the same bookcase, and in the same section. 
Imagine, for instance, being a librarian who is tasked with gathering all the books 
on the same topic without any helpful indication of a preexisting index or label. 
Grouping similar objects or examples in this case is clustering. In machine learn- 
ing, it is an unsupervised task. It allows you to create labels when no labeling is 
available, or when creating new labeling empirically is helpful. 


With the library example, a good solution to the lack of an index or labels would 
be picking books here and there at random, each one located in a different book- 
case and bookshelf, and then looking for similar books in all directions. You could 
go as far from the original book as makes sense. In a short time, based on the 
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books’ locations, you could partition your library into homogeneous areas of 
books around similar topics. After reading a single representative book from each 
area, you could easily and confidently label all the books located there by topic. 


Based on this same idea (starting from an example and looking in all directions 
within a given range), an entire series of statistical algorithms, called partition 
algorithms, help you explore data dimensions around starting examples by aggre- 
gating the ones that are similar. Among partition algorithms, K-means is the 
most well known and popular. It usually works out good solutions by leveraging 
the nearness of similar examples in a data space, drawing the boundaries of 
classes, and ultimately recovering any unknown group structure. K means allows 
labeling, summarization, and sometimes a deeper understanding of hidden 
dynamics of data. K-means can help you achieve the following: 


>> Labeling examples into separated groups 


>» Creating new features (the labels of the groups) for use in supervised learning 
tasks (labels from a cluster analysis are very helpful as new features when 
learning from text and images) 


>> Grouping anomalous examples into groups of their own, thus helping you to 
locate them easily 


K-means is not the only algorithm capable of performing clustering tasks. Clus- 
tering (also known as cluster analysis) has a long history, and algorithms of dif- 
ferent kinds exist for performing it. There are clustering methods that arrange the 
examples into tree-like structures (hierarchical clustering) and others that look 
for parts of the data space where examples are denser (DBScan). Others figure out 
whether any cluster is derivable from certain statistical distributions, such as a 
Gaussian. Among so many choices, K-means has become a very successful algo- 
rithm in machine learning for good reasons: 


>» It's easy and intuitive to understand. 
>> It can be fast and scales nicely to large amounts of data. 


>> It does not require keeping too much information in memory. 


>> Its output is useful as an input to other machine learning algorithms. 


Checking assumptions and expectations 


K-means relies on assumptions that some people dispute and that you need to 
know about. First, the algorithm assumes that your data has groups called clusters. 
It also assumes that the groups are made of similar examples, with the starting 
example, called the prototype or the centroid, in the center of the cluster. Finally, it 
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assumes that, in the data space, groups have a rigorously spherical-like shape. 
Therefore, regretfully, K-means doesn’t allow strange shapes, which can be a 
weakness of the technique because the real world isn’t always geometrically 
shaped. These are all theoretical assumptions. (The algorithm can work well when 
the data meets the conditions; otherwise, you must check the result closely.) 


K-means works with a numeric measure, the Euclidean distance. All your data has 
to be in the form of a number representing a measure (technically called a metric 
measure; for instance, everyday metrics measures are meters and kilos). You can’t 
use variables whose values are assigned arbitrarily; the measure should have some 
relation to reality. However, even though it’s not ideal, you can use ordinal num- 
bers (like 15t, 2™4, 3*4, and so on, because they have a measure-like order). You can 
also use binary variables (1/0), as described in the previous chapter. 


The Euclidean distance is the root of a big sum, so your variables have to be all of 
the same scale or the variables with the larger range will dominate the distance 
(and you’ll create clusters on just those variables). The same domination also 
occurs if some variables are correlated, that is, they share a part of their informa- 
tive content (variance). Again, you have some variables influencing the result 
more than others do. One solution is to transform your data before the K-means 
by statistically standardizing all the variables and transforming them into com- 
ponents by a dimensionality reduction algorithm such as principal component 
analysis (PCA), as described in the previous chapter. 


K-means also expects you to already know how many clusters your data contains. 
However, it isn’t a big problem if you don’t know, because you can guess or try 
many solutions, starting from the desirable ones. Because the algorithm makes so 
many theoretical and practical assumptions, it always comes up with a solution 
(which is why everyone loves it so much). When you work with data with no clusters 
or ask for the wrong number of clusters, it can provide you with some misleading 
results. You can distinguish good results from bad ones based on the following: 


>> Heuristics: You can measure the quality of the clustering. 
>> Reproducibility: Random results cannot be replicated. 
>> Understandability: Absurd solutions are seldom real solutions. 


>> Usability: You care about how machine learning practically solves problems 
and aren't concerned about its correctness in terms of assumptions. 


The K-means algorithm is an unsupervised algorithm, so unless you know the 
cluster solution beforehand, you don’t have any error to measure in terms of devi- 
ance or accuracy. After getting a solution, always do a reality check with the clus- 
tering quality measures to see whether the result is reproducible under different 
conditions, makes sense, and can help with your problem. 
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Inspecting the gears of the algorithm 


The K-means algorithm performs tasks in a specific way. By understanding the 
procedure that the algorithm uses to perform tasks, you can better understand 
how to employ the K-means algorithm. You can detail the procedure that the 
algorithm uses in a few steps: 


1. After you instruct the algorithm that there are k clusters in the data (where k is 
an integer number), the algorithm picks k random examples as the original 
centroids of your k clusters. 


2. The algorithm assigns all the examples to each of the k clusters based on their 
Euclidian distance to each one's centroid. The nearest centroid wins the 
example, which then becomes part of its cluster. 


3. After assigning all the examples to the clusters, the algorithm recalculates the 
new centroid of each one by averaging all the examples that are part of the 
group. After the first round, the new centroids likely won't coincide with a real 
example anymore. At this point, when thinking of centroids, consider them as 
ideal examples (actually, prototypes). 


4. fitis not the first round, after averaging, the algorithm checks how much the 
position of centroids has changed in the data space. If it has not changed all 
that much from the previous round, the algorithm assumes a stable solution 
and returns the solution to you. Otherwise, the algorithm repeats Steps 2 
and 3. Having changed the position of centroids, the algorithm reassigns part 
of the examples to a different cluster, which likely leads to a change in the 
centroid position. 


Given the jumps between Steps 4 and 2 until the output meets a certain conver- 
gence condition, the K-means algorithm is iterative. Iteration after iteration, the 
initial centroids, which the algorithm chooses randomly, move their position until 
the algorithm finds a stable solution. (The examples don’t move anymore between 
clusters, or at least few do.) At that point, after the algorithm has converged, you 
can expect that 


>» All your data is separated into clusters (so each example will have one and just 
one cluster label). 


>? All the clusters tend to have the maximum internal cohesion possible. You can 
compute the cohesion for every cluster by subtracting the centroid position 
from the position of each example, then squaring the result of each subtrac- 
tion (so you remove the sign), and finally summing all the results. Thus, you 
obtain the cohesion, which in a cluster analysis always tends to be the 
minimum possible (called within-cluster sum of squares or WSS). 
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>> All the clusters have maximum external difference possible. This means that if 
you take the difference of each centroid with the average of the data space 
(the grand centroid), square each difference, multiply each of them by their 
respective cluster’s number of examples, and then sum all results together, 
the result is the maximum possible (between-cluster sum of squares or BSS). 


Because the between-cluster sum of squares is dependent on the result of the 
within-cluster calculation, you need to look at only one of them (usually WSS will 
suffice). Sometimes the starting position is an unlucky one, and the algorithm 
doesn’t converge at a proper solution. The data is always partitioned, so you can 
only guess that the algorithm has performed the work acceptably by calculating 
the within-cluster sum of squares of the solution and comparing it with previous 
calculations. 


If you run the K-means a few times and record the results, you can easily spot 
algorithm runs that had a higher within-cluster sum of squares and a lower 
between-cluster result as the solution that you can’t trust. Depending on your 
computational power at hand and the dataset size, running many trials can con- 
sume a lot of time, and you have to decide how to trade time for safety in choosing 
a good solution. 


Tuning the K-Means Algorithm 
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To get the best possible results from the K-means algorithm, you must tune it. 
Tuning a K-means algorithm requires clear ideas about its purpose: 


>> If the purpose is explorative, stop at the number of clusters when the 
solution makes sense and you can determine which clusters to use by 
naming them. 


>> If you are working with abstract data, looking at the within-cluster sum of 
squares or at some other tuning measure can help hint at the right solution. 


>> If you need the cluster results to feed a supervised algorithm, use 
cross-validation to determine the solution that brings more predictive power. 


The next step requires you to decide on an implementation. The Python language 
offers two versions of the algorithm in the Scikit-learn package. The first one is 
the classical algorithm, sklearn.cluster.KMeans. You can also use a mini-batch 
version, sklearn.cluster.MiniBatchKMeans, which differs from the standard 
K-means because it can compute new centroids and reassign all previous cluster 
labels on portions of the data (instead of making all the computations after evalu- 
ating all the data sample). 
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The advantage of the mini-batch is that it can process data that won’t fit in the 
available memory of your computer by fetching examples in small chunks from 
disk. The algorithm processes each chunk, updates the clusters, and then loads 
the next chunk. The only possible bottleneck is the speed of data transfer. The 
process takes more time than the classical algorithm, but when it’s finished com- 
puting (and it may take a few passages on all your data), you’ll have a complete 
model that’s not much different from the model you could have obtained using 
the standard algorithm. 


sklearn.cluster .MiniBatchKMeans has two fit methods: 


>> fit: Works with data in memory and stops after it processes the information 
available based on the batch size that you set using the batch_size 
parameter. 


>» partial_fit: Processes the data in memory, but then remains open to start 
again when presented with new data, so it’s perfect for streaming data in 
blocks from disk or from a network such as the Internet. 


sklearn.cluster.KMeans offers all the standard parameters discussed earlier: 
the number of clusters (n_clusters) and the initialization method (init). More- 
over, it offers the possibility to precompute distances (precompute_distances). If 
the number of effective iterations is high, sklearn.cluster.KMeans calculates 
the distances repetitively, wasting time. If you have memory and need speed, just 
set precompute_distances to TRUE, which stores all the calculations in advance. 
You can also instruct the algorithm to work in parallel (setting n_ job to -1) when 
it has to create many solutions simultaneously (because of different random ini- 
tializations). Precomputing distances and parallel computation make the Python 
implementation currently the fastest available. 


Experimenting K-means reliability 


The book’s first experiment to demonstrate K-means nuances uses Python and 
the Iris dataset, a popular example dataset about three species of iris flowers that 
had their petals and sepals (part of the flower, supporting the petals when in 
bloom) measured. Introduced by the statistician Ronald Fisher in one of his papers 
in 1936 to demonstrate linear discriminant analysis (a statistical predictive analy- 
sis), the tricky thing about this dataset is that two iris species (Virginica and Ver- 
sicolor) need combined measurements to aid in distinguishing one from the other 
(in supervised learning). You can’t resolve labeling in unsupervised learning by 
using just the given information. The Iris dataset is also a balanced dataset because 
you use the same number of examples for each of the three iris species, as shown 
in the following Python example. 
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from sklearn.datasets import load_iris 
data = load_iris() 

print ("Features :%s" % data. feature_names) 
features = data.data 

labels = data.target 


Features :['sepal length (cm)', 'sepal width (cm)', 
"petal length (cm)', ‘petal width (cm)'] 


The experiment uses both of the two available K-means versions in Scikit-learn: 
the standard algorithm and the mini-batch version. In the Scikit-learn package, 
you must define a variable for each learning algorithm in advance, specifying its 
parameters. Therefore, you define two variables, k_means and mb_k_means, which 
requires three clusters, a smart initialization procedure (called ‘k-means++’ ), and 
raising the number of maximum iterations to a high figure (don’t worry, usually 
the algorithm is quite fast). Finally, you fit the features, and after a short delay, 
the computations complete. 


from sklearn.cluster import MiniBatchKMeans, KMeans 
k_means = KMeans(n_clusters=3, init='k-means++', 
max_iter=999, n_init=1, random_state=1@1 ) 
mb_k_means = MiniBatchKMeans(n_clusters=3, init='k-means++', 
max_iter=999, batch_size=10, n_init=1, random_state=101 ) 


k_means. fit( features) 
mb_k_means. fit( features) 


The following code prints a nice plot on your screen of how points (our examples 
of flowers) distribute on a map made of sepal length and width, as shown in 
Figure 14-2. If you’re working with IPython Notebook, %matplotlib inline dis- 
plays the chart inside your notebook. 


amatplotlib inline 
import matplotlib.pyplot as plt 
plt.scatter(features[:,@], features[:,1], s=2**7, c=labels, 
edgecolors='white', alpha=0.85, cmap='autumn' ) 
plt.grid() # adds a grid 
plt.xlabel(data.feature_names[@]) # adds label to x axis 
plt.ylabel(data.feature_names[1]) # adds label to y axis 
# Printing centroids, first of regular K-means, then of mini- 
batch 
plt.scatter(k_means.cluster_centers_[:,@], k_means.cluster_ 
centers_[:,1], 
s=2xx6, marker='s', c='white') 
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FIGURE 14-2: 
Clusters of iris 
species plotted 
on a chart based 
on sepal length 
and width. 


plt.scatter(mb_k_means.cluster_centers_[:,@], 
mb_k_means.cluster_centers_[:,1], s=2**8, 
marker='*', c='white' ) 
for class_no in range(@,3): # We just annotate a point 
for each class 
plt.annotate(data.target_names[class_no], 
(features [3+50*class_no,@] , features[3+50*class_no,1])) 
plt.show() # Showing the result 


sepal width (cm) 





40 45 5.0 5.5 6.0 65 7.0 75 8.0 85 
sepal length (cm) 


What is noticeable about the plot is that it also displays the centroids — those 
from the standard algorithm as squares; those from the mini-batch version as 
stars — and they don’t differ all that much. This fact demonstrates how the dif- 
ferent learning procedures lead to almost identical conclusions. Sometimes it’s 
amazing to see how even algorithms that are different may arrive at the same 
conclusions. When that doesn’t happen, you might have too much variance in the 
estimates, and each algorithm could have a very different learning strategy; yet, 
you’ll notice that machine learning algorithms often tend to get the same strong 
signals (though in the end they make different use of them). 


All the information about the algorithm is now stored in the variables. For 
instance, by typing k_means.cluster_centers_, you can get all the coordinates 
of the centroids elaborated by the K-means procedure. 


Experimenting with how centroids converge 


Though you can now compute the result and know that the different versions and 
runs of K-means tend to arrive at similar solutions, you still need to grasp how 
you reach the result. In this section’s experiment, you follow how a centroid 
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(the second one) changes along iterations by visualizing its vector of coordinates 
step by step along the optimization. 


import numpy as np 
np.set_printoptions(precision=3, suppress=True) # sets 
output 3 dec points 
for iteration in range(1, 10): 
k_means = KMeans(n_clusters=3, init='random', 
max_iter=iteration, n_init=1, 
random_state=101 ) 
k_means. fit( features) 
print ("Iteration: %i - 2nd centroid: %s" % 
(iteration, k_means.cluster_centers_[1])) 


Iteration: 1 - 2nd centroid: 9.362 3.763 1.512 0.275] 
Iteration: 2 - 2nd centroid: 4.959 3.352 1.47 Q.246] 
Iteration: 3 - 2nd centroid: 4.914 3.268 1.539 @.275| 
Iteration: 4 -— 2nd centroid: 4.878 3.188 1.58 0.295] 
Iteration: 5 - 2nd centroid: 4.833 3.153 1.583 0.294] 
Iteration: 6 - 2nd centroid: 4.8 3.109 1.606 0.303] 
Iteration: 7 - 2nd centroid: 4.783 3.087 1.62 0.307] 
Iteration: 8 - 2nd centroid: 4.776 3.072 1.621 Q.297] 
Iteration: 9 - 2nd centroid: 4.776 3.072 1.621 0.297] 











Observing the adjusting values as iterations proceed, the rate of change dimin- 
ishes until later iterations, when the change from each passage is so small that 
you can’t see it without using many decimal places. 


The clustering module in Scikit-learn contains all the presented versions of 
K-means plus other clustering algorithms. You can find it at http: //scikit- 
learn.org/stable/modules/classes .html#module-sklearn.cluster. In addi- 
tion, R has a rich and efficient implementation of K-means. You can find it in the 
library stats (https: //stat.ethz.ch/R-manual /R-devel/library/stats/html/ 
kmeans .htm1). 


The next experiment, which uses R, aims to check how good the K-means algo- 
rithm is at guessing the real clusters in the Iris dataset. This experiment is a chal- 
lenging task because various studies say that it isn’t possible to achieve it 
accurately (see https://en.wikipedia.org/wiki/Iris_flower_data_set for 
details). 


# We call the libraries 


library(datasets ) 
library(class) 
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# We divide our dataset into answer and features 
answer <- iris[,5] 

features <- iris[,1:4] 

X <- princomp( features )$scores 


clustering <- kmeans(x=X, centers=3, iter.max = 999, nstart = 10, 
algorithm = "Hartigan-Wong" ) 


print (clustering$tot.withinss) 
table(answer, clustering$cluster ) 


answer d@ 8 
setosa @50 @ 
versicolor 2 @ 48 
virginica 36 @ 14 


Working with the Iris dataset allows tests on both real data using a ground truth 
(which is another way of saying that the examples are labelled). This example first 
creates a PCA solution and then computes a three-cluster solution. The PCA 
removes the correlations and standardizes the variables at the same time, so you 
can be confident that the Euclidean distance measurement that K-means com- 
putes works well. Finally, you print a confusion matrix, a matrix representation of 
where the correct answers are in the rows and the predicted ones are in columns. 
Glancing at the confusion matrix, you notice that 14 flowers of the Virginica spe- 
cies are classified as Versicolor. K-means wasn’t able to recover the correct natural 
classes at this time. Usually, increasing the number of clusters solves such prob- 
lems, although doing so may generate many clusters that differ only slightly from 
each other. From a machine learning point of view, having more clusters is not all 
that much fuss, but for humans, it may render understanding more complicated. 
This experiment tests different clustering solutions in an iterative loop with the 
output shown in Figure 14-3. 


w <- rep(@,1@) 
for (h in 1:10) { 
clustering <- kmeans(x=X, centers=h, iter.max = 999, 
nstart = 10, 
algorithm = "Hartigan-Wong" ) 
w[h] <- clustering$tot.withinss 


} 


plot(w, type='o') 
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FIGURE 14-3: 
Plot of a 
within-cluster 
sum of squares 
regarding 
different cluster 
solutions. 


FIGURE 14-4: 
Iris species 
represented by 
eight clusters. 
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clustering <- kmeans(x=X, centers=8, iter.max = 999, nstart = 10, 
algorithm = "Hartigan-Wong" ) 


table(answer, clustering$cluster ) 

answer da 8 #&@ ® © YY 
setosa 2228 0 0 © 0 0 Q 
versicolor © © 320 © 918 ð 
virginica @ @15 122 0 012 


plot(X[,c(1,2)], col = clustering$cluster) 
points(clustering$centers[,c(1,2)], col = 1:8, 
pen = is, Cae = lA) 


A loop over many solutions reveals that after a four-cluster solution, improve- 
ments are slight. Such is a quite common (and disappointing) situation, and the 
best heuristic is to see when the WSS curve flattens out. In this case, the right 
solution is eight clusters because you can test it and visually confirm that this 
solution is the best in separating classes, as shown in Figure 14-4. 
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When working with real data, you rarely have a ground truth to check. Trust your 
intuition and the cluster quality measures if you can’t rely on a test. If you’re not 
satisfied with the solutions, look for ways to add and remove one or more of your 
REMEMBER features. 





Searching for Classification by 
K-Nearest Neighbors 


No matter if the problem is to guess a number or a class, the idea behind the 
learning strategy of the K-Nearest Neighbors (KNN) algorithm is always the same. 
The algorithm finds the most similar observations to the one you have to predict 
and from which you derive a good intuition of the possible answer by averaging 
the neighboring values, or by picking the most frequent answer class among them. 


The learning strategy in a KNN is more like memorization. It’s just like remem- 
bering what the answer should be when the question has certain characteristics 
(based on circumstances or past examples) rather than really knowing the answer, 
because you understand the question by means of specific classification rules. In 
a sense, KNN is often defined as a lazy algorithm because no real learning is done 
at the training time, just data recording. 


Being a lazy algorithm implies that KNN is quite fast at training but very slow at 
predicting. (Most of the searching activities and calculations on the neighbors is 
done at that time.) It also implies that the algorithm is quite memory intensive 
because you have to store your dataset in memory (which means that there’s a 
limit to possible applications when dealing with big data). Ideally, KNN can make 
the difference when you’re working on classification and you have many labels to 
deal with (for instance, when a software agent posts a tag on a social network or 
when proposing a selling recommendation). KNN can easily deal with hundreds of 
labels, whereas other learning algorithms have to specify a different model for 
each label. 


Usually, KNN works out the neighbors of an observation after using a measure of 
distance such as Euclidean (the most common choice) or Manhattan (works better 
when you have many redundant features in your data). No absolute rules exist 
concerning what distance measure is best to use. It really depends on the imple- 
mentation you have. You also have to test each distance as a distinct hypothesis 
and verify by cross-validation as to which measure works better with the problem 
you’re solving. 
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Leveraging the Correct K Parameter 


The k parameter is the one you can work on tuning to make a KNN algorithm per- 
form well in prediction and regression. The following sections describe how to use 


the k parameter to tune the KNN algorithm. 


Understanding the k parameter 


The k value, an integer number, is the number of neighbors that the algorithm has 
to consider in order to figure out an answer. The smaller the k parameter, the 
more the algorithm will adapt to the data you are presenting, risking overfitting 
but nicely fitting complex separating boundaries between classes. The larger the k 
parameter, the more it abstracts from the ups and downs of real data, which 
derives nicely smoothed curves between classes in data, but does so at the expense 


of accounting for irrelevant examples. 


As a rule of thumb, first try the nearest integer of the square root of the number 
of examples available as a k parameter in KNN. For instance, if you have 1,000 
examples, start with k = 31 and then decrease the value in a grid search backed up 


TIP by cross-validation. 


Using irrelevant or unsuitable examples is a risk that a KNN algorithm takes as the 
number of examples it uses for its distance function increases. The previous illus- 
tration of the problem of data dimensions shows how to compute a well-ordered 
data space as a library in which you could look for similar books in the same book- 
shelf, bookcase, and section. However, things won’t look so easy when the library 
has more than one floor. At that point, books upstairs and downstairs are not 
necessarily similar; therefore, being near but on a different floor won’t assure that 
the books are similar. Adding more dimensions weakens the role of useful ones, 


but that is just the beginning of your trouble. 


Now imagine having more than the three dimensions in daily life (four if you con- 
sider time). The more dimensions, the more space you gain in your library. (As in 
geometry, you multiply dimensions to get an idea of the volume.) At a certain 
point, you will have so much space that your books will fit easily with space left 
over. For instance, if you have 20 binary variables representing your library, you 
could have 2 raised to the 20‘ power combinations; that is, 1,048,576 possible 
different bookcases. It’s great to have a million bookcases, but if you don’t have a 
million books to fill them, most of your library will be empty. So you obtain a book 
and then look for similar books to place it with. All your nearby bookcases are 
actually empty, so you have to go far before finding another nonempty bookcase. 
Think about it: You start with The Hitchhiker’s Guide to the Galaxy and end up having 
a book on gardening as its nearest neighbor. This is the curse of dimensionality. 
The more dimensions, the more likely you are to experience some false similarity, 


misunderstanding far for near. 
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TIP 


Using the right-sized k parameters alleviates the problem because the more 
neighbors you have to find, the further KNN has to look — but you have other 
remedies. Principal component analysis (PCA) can compress the space, making it 
denser and removing noise and irrelevant, redundant information. In addition, 
feature selection can do the trick, selecting only the features that can really help 
KNN find the right neighbors. 


As explained in Chapter 11 about validating machine learning tasks, a stepwise 
selection, checked by cross-validation, can make a KNN work well because it 
keeps only the features that are truly functional for the task. 


KNN is an algorithm that’s sensitive to outliers. Neighbors on the boundaries of 
your data cloud in the data space could be outlying examples, causing your predic- 
tions to become erratic. You need to clean your data before using it. Running a 
K-means first can help you identify outliers gathered into groups of their own. 
(Outliers love to stay in separate groups; you can view them as the hermit types in 
your data.) Also, keeping your neighborhood large can help you minimize (but 
sometimes not avoid completely) the problem at the expense of a lower fit to the 
data (more bias than overfitting). 


Experimenting with a flexible algorithm 


The KNN algorithm has slightly different implementations in R and Python. In R, 
the algorithm is found in the library class. The function is just for classification 
and uses only the Euclidean distance for locating neighbors. It does offer a conve- 
nient version with automatic cross-validation for discovering the best k value. 
There’s also another R library, FNN (https://cran.r-project.org/web/ 
packages/FNN/index.htm1), which contains one KNN variant for classification 
and another for regression problems. The peculiarity of the FNN functions is that 
they can deal with the complexity of distance computations using different algo- 
rithms, but the Euclidean distance is the only distance available. 


The following R code experiment uses the cross-validated KNN from the library 
class. It looks for the best k value by cross-validation and then tests it on an out- 
of-sample portion of the data. Learning the correct hyper-parameters using 
cross-validation guarantees that you find the best value, not just for the single 
analyzed data but also for any other possible data arriving from the same source. 
Testing using out-of-sample data offers a realistic evaluation of the accuracy of 
the learned model because the data was never used for any setting during the 
learning phase. 


set.seed(seed=101 ) 
out_of_sample <- sample(x=length(answer ) , 25) 
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# in a loop we try values of k ranging from 1 to 15 
for (in im dedS) 4 


in_sample_pred <- knn.cv(train=features[-out_of_sample, |, 
cl=answer [-out_of_sample], 
k =h, 1 = @, prob = FALSE, use.all = 
TRUE) 
# After getting the cross-validated predictions, 
# we calculate the accuracy 
accuracy <- sum(answer[-out_of_sample]==in_sample_pred) / 
length( answer [-out_of_sample] ) 
# We print the result 
print (paste("for k=",h," accuracy is:",accuracy) ) 


} 

4] "for k= 1 accuracy is: 9.952" 
1] "for k= 2 accuracy is: 0.968" 
4] "for k= 3 accuracy is: 0.96" 
4] "for k= 4 accuracy is: 0.96" 
4] "for k= 5 accuracy is: 9.952" 
4] "for k= 6 accuracy is: 0.952" 
4] "for k= 7 accuracy is: 0.968" 
4] "for k= 8 accuracy is: 0.968" 
4] "for k= 9 accuracy is: 0.968" 


1] "for k= 10 accuracy is: 0.968" 
4] "for k= 11 accuracy is: 0.976" 
4] "for k= 12 accuracy is: 0.968" 
4] "for k= 13 accuracy is: 0.968" 
1] "for k= 14 accuracy is: 0.968" 
4] "for k= 15 accuracy is: 0.96" 





out_sample_pred <- knn(train=features[-out_of_sample, ], 
test=features[out_of_sample, ], 
cl=answer [-out_of_sample], k = 11, 
(OP probe — mL U Erm sercile lal Ep) 


print (table(answer[out_of_sample], out_sample_pred) ) 
out_sample_pred 
setosa versicolor virginica 
setosa versicolor virginica 


setosa T @ 
versicolor @ 10 1 
virginica @ (2) 
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The cross-validated search indicates that setting k to value 11 scores the best 
accuracy. The example then predicts a result using the untouched test set and 
verifies the results using a confusion matrix by cross-validating on rows contain- 
ing real values against estimated values on columns. The performance is high, as 
expected, with just an Iris Versicolor example misclassified as an Iris Virginica. 


The second experiment with KNN uses the Python class from Scikit-learn and 
demonstrates how such a simple algorithm is quite apt at learning shapes and 
nonlinear arrangements of examples in the data space. The block of code prepares 
a tricky dataset: In two dimensions, two classes are arranged in bull’s-eye con- 
centric circles, as shown in Figure 14-5. 


import numpy as np 
from sklearn.datasets import make_circles, make_blobs 


strange_data = make_circles(n_samples=500, shuffle=True, 
noise=0.15, random_state=101, 
factor=0.5) 


center = make_blobs(n_samples=1@0, n_features=2, 
centers=1, cluster_std=0.1, 
center_box=(@, @)) 
first_half = np.row_stack((strange_data[@][:250,:], 
center [@][:5@,:])) 
first_labels = np.append(strange_data[1i][:250], 
np.array([@]*5@) 
np.row_stack((strange_data[@][25@:,:], 
center [@] [50:,:])) 
second_labels = np.append(strange_data[1][250:], 
np.array( [Q]*5Q) 


second_half 


I 








amatplotlib inline 
import matplotlib.pyplot as plt 
plt.scatter(first_half[:,@], first_half[:,1], s=2**7, 
c=first_labels, edgecolors='white', 
alpha=0.85, cmap='winter' ) 
plt.grid() # adds a grid 
plt.show() # Showing the result 


After having built the dataset, you can test the experiment by setting the classifi- 
cation algorithm to learn the patterns in data after fixing a neighborhood of 3 and 
set weights to be uniform (Scikit-learn allows you to weight less distant observa- 
tions when it’s time to average them or pick the most frequent observations), and 
the Euclidean distance as metric. Scikit-learn algorithms, in fact, allow you to 
both regress and classify using different metrics, such as Euclidean, Manhattan, 
or Chebyshev, as shown in this Python code. 
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from sklearn.neighbors import KNeighborsClassi fier 

from sklearn.metrics import accuracy_score 

kNN = KNeighborsClassi fier(n_neighbors=3, 
weights='uniform', 
algorithm='auto', 
metric='euclidean' ) 

kNN. fit(first_half, first_labels) 

print ("Learning accuracy score:%0.3f" % 

accuracy_score(y_true=second_labels, 
y_pred=kNN.predict(second_half))) 


Learning accuracy score:0.937 
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IN THIS CHAPTER 


Illustrating linear models for 
regression and classification 


Preparing the right data for linear 
models 


Limiting the influence of less useful 
and redundant variables 


Selecting effective feature subsets 


Learning from big data by stochastic 
gradient descent 


Working with 
Linear Models 
the Easy Way 


ou can’t complete an overview of basic machine learning algorithms with- 

out discovering the linear models family, a common and excellent type of 

starting algorithm to use when trying to make predictions from data. Linear 
models comprise a wide family of models derived from statistical science, although 
just two of them, linear regression and logistic regression, are frequently men- 
tioned and used. 


Statisticians, econometricians, and scientists from many disciplines have long 
used linear models to confirm their theories by means of data validation and to 
obtain practical predictions. You can find an impressive number of books and 
papers about this family of models stocked in libraries and universities. This mass 
of literature discusses many applications as well as the sophisticated tests and 
statistical measures devised to check and validate the applicability of linear mod- 
els to many types of data problems in detail. 


CHAPTER 15 vith Lit s the Ea: 257 


Machine learning adherents adopted linear models early. However, because learn- 
ing from data is such a practical discipline, machine learning separates linear 
models from everything related to statistics and keeps only the mathematical 
formulations. Statistics powers the models using stochastic descent — the opti- 
mization procedure discussed in Chapter 10. The result is a solution that works 
effectively with most learning problems (although preparing the data requires 
some effort). Linear models are easy to understand, fast to create, a piece of cake 
to implement from scratch, and, by using some easy tricks, they work even with 
big data problems. If you can master linear and logistic regression, you really 
have the equivalent of a Swiss Army knife for machine learning that can’t do 
everything perfectly but can serve you promptly and with excellent results in 
many occurrences. 


Starting to Combine Variables 
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Regression boasts a long history in different domains: statistics, economics, psy- 
chology, social sciences, and political sciences. Apart from being capable of a large 
range of predictions involving numeric values, binary and multiple classes, prob- 
abilities, and count data, linear regression also helps you to understand group 
differences, model consumer preferences, and quantify the importance of a fea- 
ture in a model. 


Stripped of most of its statistical properties, regression remains a simple, under- 
standable, yet effective algorithm for the prediction of values and classes. Fast to 
train, easy to explain to nontechnical people, and simple to implement in any 
programming language, linear and logistic regression are the first choice of most 
machine learning practitioners when building models to compare with more 
sophisticated solutions. People also use them to figure out the key features in a 
problem, to experiment, and to obtain insight into feature creation. 


Linear regression works by combining through numeric feature summation. Add- 
ing a constant number, called the bias, completes the summation. The bias repre- 
sents the prediction baseline when all the features have values of zero. Bias can 
play an important role in producing default predictions, especially when some of 
your features are missing (and so have a zero value). Here’s the common formula 
for a linear regression: 


y=pX+a 


In this expression, y is the vector of the response values. Possible response vectors 
are the prices of houses in a city or the sales of a product, which is simply any 
answer that is numeric, such as a measure or a quantity. The X symbol states the 
matrix of features to use to guess the y vector. X is a matrix that contains only 
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REMEMBER 


numeric values. The Greek letter alpha (a) represents the bias, which is a con- 
stant, whereas the letter beta (f) is a vector of coefficients that a linear regression 
model uses with the bias to create the prediction. 


Using Greek letters alpha and beta in regression is widespread to the point that 
most practitioners call the vector of coefficients the regression beta. 


You can make sense of this expression in different ways. To make things simpler, 
you can imagine that X is actually composed of a single feature (described as a 
predictor in statistical practice), so you can represent it as a vector named x. When 
only one predictor is available, the calculation is a simple linear regression. Now that 
you have a simpler formulation, your high school algebra and geometry tell you 
that the formulation y=bx+da is a line in a coordinate plane made of an x axis (the 
abscissa) and a y axis (the ordinate). 


When you have more than one feature (a multiple linear regression), you can’t use a 
simple coordinate plane made of x and y anymore. The space now spans multiple 
dimensions, with each dimension being a feature. Now your formula is more 
intricate, composed of multiple x values, each weighted by its own beta. For 
instance, if you have four features (so that the space is four-dimensional), the 
regression formulation, as explicated from matrix form, is 


y =x ,b, + xX»by + x3b3+x,4b,+b 


This complex formula, which exists in a multidimensional space, isn’t a line any- 
more, but rather a plane with as many dimensions as the space. This is a hyperplane, 
and its surface individuates the response values for every possible combination of 
values in the feature dimensions. 


This discussion explains regression in its geometrical meaning, but you can also 
view it as just a large weighted summation. You can decompose the response into 
many parts, each one referring to a feature and contributing to a certain portion. 
The geometric meaning is particularly useful for discussing regression properties, 
but the weighted summation meaning helps you understand practical examples 
better. For instance, if you want to predict a model for advertising expenditures, 
you can use a regression model and create a model like this: 


sales = advertising * bag, + Shops * Ð snop + price * b price +a 


In this formulation, sales are the sum of advertising expenditures, number of 
shops distributing the product, and the product’s price. You can quickly demystify 
linear regression by explaining its components. First, you have the bias, the con- 
stant a, which acts as a starting point. Then you have three feature values, each 
one expressed in a different scale (advertising is a lot of money, price is some 
affordable value, and shops is a positive number), each one rescaled by its respec- 
tive beta coefficient. 
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TIP 


Each beta presents a numeric value that describes the intensity of the relationship 
to the response. It also has a sign that shows the effect of a change in feature. 
When a beta coefficient is near zero, the effect of the feature on the response is 
weak, but if its value is far from zero, either positive or negative, the effect is sig- 
nificant and the feature is important to the regression model. 


To obtain an estimate of the target value, you scale each beta to the measure of the 
feature. A high beta provides more or less effect on the response depending on the 
scale of the feature. A good habit is to standardize the features (by subtracting the 
mean and dividing by standard deviation) to avoid being fooled by high beta val- 
ues on small-scale features and to compare different beta coefficients. The result- 
ing beta values are comparable, allowing you to determine which ones have the 
most impact on the response (those with the largest absolute value). 


If beta is positive, increasing the feature will increase the response, whereas 
decreasing the feature will decrease the response. Conversely, if beta is negative, 
the response will act contrary to the feature: When one is increasing, the other is 
decreasing. Each beta in a regression represents an impact. Using the gradient 
descent algorithm discussed in Chapter 10, linear regression can find the best set 
of beta coefficients (and bias) to minimize a cost function given by the squared 
difference between the predictions and the real values: 


J(w) = LY Xw- yy’ 


This formula tells you the cost J as a function of w, the vector of coefficients of the 
linear model. The cost is the summed squared difference of response values from 
predicted values (the multiplication Xw) divided by two times the number of 
observations (n). The algorithm strives to find the minimum possible solution 
values for the difference between the real target values and the predictions derived 
from the linear regression. 


You can express the result of the optimization graphically as the vertical distances 
between the data points and the regression line. The regression line represents 
the response variable well when the distances are small, as shown in Figure 15-1. 
If you sum the squares of the distances, the sum is always the minimum possible 
when you calculate the regression line correctly. (No other combination of beta 
will result in a lower error.) 


In statistics, practitioners often refer to linear regression as ordinary least squares 
(OLS), indicating a particular way of estimating the solution based on matrix calcu- 
lus. Using this approach isn’t always feasible. It depends on the computation of the 
data matrix inversion, which isn’t always possible. In addition, matrix inverse com- 
putations are quite slow when the input matrix is large. In machine learning, you 
obtain the same results using the radient descent optimization, which handles larger 
amounts of data easier and faster, thus estimating a solution from any input matrix. 
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FIGURE 15-1: 

An example of 
visualizing errors 
of a regression 
line and plane. 
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R versions of linear and logistic regression heavily rely on the statistical approach. 
Python versions, such as those found in the Scikit-learn package, use gradient 
descent. For this reason, the book illustrates examples of linear models using 
Python. 


The only requirement for gradient descent is to standardize (zero mean and unit 
variance) or normalize (feature values bound between +1 and —1) the features 
because the optimization process is sensitive to differently scaled features. Stan- 
dard implementations implicitly standardize (so you don’t have to remember this 
detail). However, if you use sophisticated implementations, scaling can turn into 
a serious issue. As a result, you should always standardize the predictors. Gradient 
descent calculates the solution by optimizing the coefficients slowly and system- 
atically. It bases its update formula on the mathematical derivative of the cost 
function: 


w =w; -a*t (Xw-y)*x, 


The single weight w,, relative to the feature j, is updated by subtracting from it a 
term consisting of a difference divided by the number of examples (n) and a learn- 
ing factor alpha, which determines the impact of the difference in the resulting 
new w) (A small alpha reduces the update effect.) The part of the formula that 
reads (Xw -y) calculates the difference between the prediction from the model 
and the value to predict. By calculating this difference, you tell the algorithm the 
size of the prediction error. The difference is then multiplied by the value of the 
feature j. The multiplication of the error by the feature value enforces a correction 
on the coefficient of the feature proportional to the value of the feature itself. 
Because the features pool into a summation, the formula won’t work with efficacy 
regardless of whether you’re mixing features of different scale — the larger scale 
will dominate the summation. For instance, mixing measures expressed in kilo- 
meters and centimeters isn’t a good idea in gradient descent, unless you trans- 
form them using standardization. 
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TIP 


Math is part of machine learning, and sometimes you can’t avoid learning formu- 
las because they prove useful for understanding how algorithms work. After 
demystifying complex matrix formulas into sets of summations, you can deter- 
mine how algorithms work under the hood and act to ensure that the algorithm 
engine works better by having the right data and correctly set parameters. 


The following Python example uses the Boston dataset from Scikit-learn to try to 
guess the Boston housing prices using a linear regression. The example also tries 
to determine which variables influence the result more. Beyond computational 
issues, standardizing the predictors proves quite useful if you want to determine 
the influential variables. 


from sklearn.datasets import load_boston 
from sklearn.preprocessing import scale 
boston = load_boston( ) 

X, y = scale(boston.data), boston.target 


The regression class in Scikit-learn is part of the 1inear_model module. Because 
you previously scaled the X variables, you don’t need to decide any other prepara- 
tions or special parameters when using this algorithm. 


from sklearn.linear_model import LinearRegression 
regression = LinearRegression() 
regression. fit(X,y) 


Now that the algorithm is fitted, you can use the score method to report the R? 
measure. 


print (regression.score(X,y) ) 
@. 740607742865 


R?, also known as coefficient of determination, is a measure ranging from 0 to 1. 
It shows how using a regression model is better in predicting the response than 
using a simple mean. The coefficient of determination is derived from statistical 
practice and directly relates to the sum of squared errors. You can calculate the R? 
value by hand using the following formula: 


2 dV Xw- y)’ 
RE =1-——— 
OG -y)’ 


The upper part of the division represents the usual difference between response and 
prediction, whereas the lower part represents the difference between the mean of the 
response and the response itself. You square and sum both differences across the 
examples. In terms of code, you can easily translate the formulation into programming 
commands and then compare the result with what the Scikit-learn model reports: 
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import numpy as np 
mean_y = np.mean(y) 
squared_errors_mean = np.sum((y-mean_y) *«*2) 
squared_errors_model = np.sum((y - 

regression. predict(X) )**2) 
R2 = 1- (squared_errors_model / squared_errors_mean) 
print (R2) 


0. 740607742865 


In this case, the R? on the previously fitted data is around 0.74, which, from an 
absolute point of view, is quite good for a linear regression model (values over 
0.90 are very rare and are sometimes even indicators of problems such as data 
snooping or leakage). 


Because R? relates to the sum of squared errors and represents how data points 
can represent a line in a linear regression, it also relates to the statistical correla- 
tion measure. Correlation in statistics is a measure ranging from +1 to —1 that tells 
how two variables relate linearly (that is, if you plot them together, they tell how 
the lines resemble each other). When you square a correlation, you get a propor- 
tion of how much variance two variables share. In the same way, no matter how 
many predictors you have, you can also compute R? as the quantity of information 
explained by the model (the same as the squared correlation), so getting near 1 
means being able to explain most of the data using the model. 


Calculating the R? on the same set of data used for the training is common in sta- 
tistics. In data science and machine learning, you’re always better off to test 
scores on data that is not used for training. Complex algorithms can memorize the 
data rather than learn from it. In certain circumstances, this problem can also 
happen when you’re using simpler models, such as linear regression. 


To understand what drives the estimates in the multiple regression model, you 
have to look at the coefficients_ attribute, which is an array containing the 
regression beta coefficients. By printing the boston.DESCR attribute, you can 
understand the variable reference. 


print ([at':'+str(round(b,1)) for a, b in 
zip(boston. feature_names, regression.coef_, )] ) 


["CRIM:-@.9", ‘ZN:4.4', 'INDUS:@.1', 'CHAS:@.7', 
'NOX:-2.4', "RM:2.7', "AGE:@.0', 'DIS:-3.1', 
"RAD Ot, “TAK 2 4 PIRATION= 2.40. 
'B:0.9', 'LSTAT:-3.7'] 
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The DIS variable, which contains the weighted distances to five employment cen- 
ters, has the largest absolute unit change. In real estate, a house that’s too far 
away from people’s interests (such as work) lowers the value. Instead, AGE or 
INDUS, which are both proportions that describe the building’s age and whether 
nonretail activities are available in the area, don’t influence the result as much; 
the absolute value of their beta coefficients is much lower. 


Mixing Variables of Different Types 
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Quite a few problems arise with the effective, yet simple, linear regression tool. 
Sometimes, depending on the data you use, the problems are greater than the 
benefits of using this tool. The best way to determine whether linear regression 
will work is to use the algorithm and test its efficacy on your data. 


Linear regression can model responses only as quantitative data. When you need 
to model categories as a response, you must turn to logistic regression, covered 
later in the chapter. When working with predictors, you do best by using continu- 
ous numeric variables; although you can fit both ordinal numbers and, with some 
transformations, qualitative categories. 


A qualitative variable might express a color feature, such as the color of a product, 
or a feature that indicates the profession of a person. You have a number of options 
for transforming a qualitative variable by using a technique such as binary encod- 
ing (the most common approach). When making a qualitative variable binary, you 
create as many features as classes in the feature. Each feature contains zero values 
unless its class appears in the data, when it takes the value of one. This procedure 
is called one-hot encoding (as previously mentioned in Chapter 9). Asimple Python 
example using the Scikit-learn preprocessing module shows how to perform one- 
hot encoding: 


from sklearn.preprocessing import OneHotEncoder , 
LabelEncoder 

lbl = LabelEncoder() 
enc = OneHotEncoder( ) 
qualitative = ['red', 'red', 'green', 'blue', 

'red', 'blue', 'blue', 'green' |] 
labels = 1bl.fit_transform(qualitative) .reshape(8,1) 
print(enc. fit_transform( labels) .toarray() ) 
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In statistics, when you want to make a binary variable out of a categorical one, you 
transform all the levels but one because you use the inverse matrix computation 
formula, which has quite a few limitations. In machine learning you use gradient 
descent, so you instead transform all the levels. 


If a data matrix is missing data and you don’t deal with it properly, the model will 
stop working. Consequently, you need to impute the missing values (for instance, 
by replacing a missing value with the mean value calculated from the feature 
itself). Another solution is to use a zero value for the missing case, and to create 
an additional binary variable whose unit values point out missing values in the 
feature. In addition, outliers (values outside the normal range) disrupt linear 
regression because the model tries to minimize the square value of the errors (also 
called residuals). Outliers have large residuals, thus forcing the algorithm to focus 
more on them than on regular points. 


The greatest linear regression limitation is that the model is a summation of inde- 
pendent terms, because each feature stands alone in the summation, multiplied 
only by its own beta. This mathematical form is perfect for expressing a situation 
in which the features are unrelated. For instance, a person’s age and eye color are 
unrelated terms because they do not influence each other. Thus, you can consider 
them independent terms and, in a regression summation, it makes sense that they 
stay separated. On the other hand, a person’s age and hair color aren’t unrelated, 
because aging causes hair to whiten. When you put these features in a regression 
summation, it’s like summing the same information. Because of this limitation, 
you can’t determine how to represent the effect of variable combinations on the 
outcome. In other words, you can’t represent complex situations with your data. 
Because the model is made of simple combinations of weighted features, it 
expresses more bias than variance in its predictions. In fact, after fitting the 
observed outcome values, the solution proposed by linear models is always a pro- 
portionally rescaled mix of features. Unfortunately, you can’t represent some 
relations between a response and a feature faithfully by using this approach. On 
many occasions, the response depends on features in a nonlinear way: Some fea- 
ture values act as hurdles, after which the response suddenly increases or 
decreases, strengthens or weakens, or even reverses. 


As an example, consider how human beings grow in height from childhood. If 


observed in a specific age range, the relationship between age and height is some- 
how linear: The older the child gets, the taller he or she is. However, some children 
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grow more (overall height) and some grow faster (growth in a certain amount of 
time). This observation holds when you expect a linear model to find an average 
answer. However, after a certain age, children stop growing and the height remains 
constant for a long part of life, slowly decreasing in later age. Clearly, a linear 
regression can’t grasp such a nonlinear relationship. (In the end, you can repre- 
sent it as a kind of parabola.) 


Another age-related example is the amount spent for consumer products. People 
in the earliest phases of their life tend to spend less. Expenditures increase during 
the middle of life (maybe because of availability of more income or larger expenses 
because of family obligations), but they decrease again in the latter part life 
(clearly another nonlinear relationship). Observing and thinking more intensely 
of the world around us shows that many nonlinear relationships exist. 


Because the relation between the target and each predictor variable is based on a 
single coefficient, you don’t have a way to represent complex relations like a 
parabola (a unique value of x maximizing or minimizing the response), an expo- 
nential growth, or a more complex nonlinear curve unless you enrich the feature. 
The easiest way to model complex relations is by employing mathematical trans- 
formations of the predictors using polynomial expansion. Polynomial expansion, 
given a certain degree d, creates powers of each feature up to the d-power and 
d-combinations of all the terms. For instance, if you start with a simple linear 
model such as the following Python code: 


y =blxl+b2x2+a 


and then use a polynomial expansion of the second degree, that model becomes 


y =b1x1+ b2x2 + a + b3x1 + *2+b4x2 **2+b5x1x2 


You make the addition to the original formulation (the expansion) using powers 
and combinations of the existing predictors. As the degree of the polynomial 
expansion grows, so does the number of derived terms. The following Python 
example uses the Boston dataset (found in previous examples) to check the tech- 
nique’s effectiveness. If successful, the polynomial expansion will catch nonlinear 
relationships in data and overcome any nonseparability (the problem seen with 
the perceptron algorithm in Chapter 12) at the expense of an increased number of 
predictors. 


from sklearn.preprocessing import PolynomialFeatures 
from sklearn.cross_validation import train_test_split 


pf = PolynomialFeatures(degree=2) 
poly_X = pf.fit_transform(X) 
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X_train, X_test, y_train, y_test = 
train_test_split(poly_X, 
y, test_size=@.33, random_state=42) 


from sklearn.linear_model import Ridge 
reg_regression = Ridge(alpha=@.1, normalize=True) 
reg_regression. fit(X_train, y_train) 
print ('R2: %0.3f' 

% r2_score(y_test,reg_regression.predict(X_test) ) ) 


R2: 0.819 


Because feature scales are enlarged by power expansion, standardizing the data 
after a polynomial expansion is a good practice. 


Polynomial expansion doesn’t always provide the advantages demonstrated by 
the previous example. By expanding the number of features, you reduce the bias 
of the predictions at the expense of increasing their variance. Expanding too much 
may hinder the capability of the model to represent general rules, making it 
unable to perform predictions using new data. This chapter provides evidence of 
this issue and a working solution in the section “Solving overfitting by using 
selection,” which discusses variable selection. 





REMEMBER 


Switching to Probabilities 


Up to now, the chapter has considered only regression models, which express 
numeric values as outputs from data learning. Most problems, however, also 
require classification. The following sections discuss how you can address both 
numeric and classification output. 


Specifying a binary response 


A solution to a problem involving a binary response (the model has to choose from 
between two possible classes) would be to code a response vector as a sequence of 
ones and zeros (or positive and negative values, just as the perceptron does). The 
following Python code proves both the feasibility and limits of using a binary 
response. 


4 = ho.aincey(|@, @, @ @, a, a, a, 11) 
b = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(8,1) 
from sklearn.linear_model import LinearRegression 


Il 


Il 
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FIGURE 15-2: 
Probabilities do 
not work as well 

with a straight 
line as they do 
with a sigmoid 

curve. 


regression = LinearRegression() 
regression. fit(b,a) 
print (regression.predict(b)>@.5) 


[False False False False True True True True] 


In statistics, linear regression can’t solve classification problems because doing so 
would create a series of violated statistical assumptions. So, for statistics, using 
regression models for classification purposes is mainly a theoretical problem, not 
a practical one. In machine learning, the problem with linear regression is that it 
serves as a linear function that’s trying to minimize prediction errors; therefore, 
depending on the slope of the computed line, it may not be able to solve the data 
problem. 


When a linear regression is given the task of predicting two values, such as 0 
and +1, which are representing two classes, it will try to compute a line that pro- 
vides results close to the target values. In some cases, even though the results are 
precise, the output is too far from the target values, which forces the regression 
line to adjust in order to minimize the summed errors. The change results in fewer 
summed deviance errors but more misclassified cases. 


Contrary to the perceptron example in Chapter 12, linear regression doesn’t pro- 
duce acceptable results when the priority is classification accuracy, as shown in 
Figure 15-2. Therefore, it won’t work satisfactorily in many classification tasks. 
Linear regression works best on a continuum of numeric estimates. However, for 
classification tasks, you need a more suitable measure, such as the probability of 
class ownership. 


0.5 











Thanks to the following formula, you can transform linear regression numeric 
estimates into probabilities that are more apt to describe how a class fits an 
observation: 


Zp expr) 
PO =D = TF expr) 
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In this formula, the target is the probability that the response y will correspond to 
the class 1. The letter r is the regression result, the sum of the variables weighted by 
their coefficients. The exponential function, exp(r ), corresponds to Euler’s num- 
ber e elevated to the power of r. A linear regression using this transformation 
formula (also called a link function) for changing its results into probabilities is a 
logistic regression. 


Logistic regression is the same as a linear regression except that the y data contains 
integer numbers indicating the class relative to the observation. So, using the 
Boston dataset from the Scikit-learn datasets module, you can try to guess what 
makes houses in an area overly expensive (median values >= 40): 


from sklearn.linear_model import LogisticRegression 

from sklearn.cross_validation import train_test_split 

binary_y = np.array(y >= 4@).astype(int) 

X_train, X_test, y_train, y_test = train_test_split(X, 

binary_y, test_size=0.33, random_state=5) 

logistic = LogisticRegression() 

logistic. fit(X_train, y_train) 

from sklearn.metrics import accuracy_score 

print('In-sample accuracy: %0.3f' % 
accuracy_score(y_train, logistic.predict(X_train) ) ) 

print('Out-of-sample accuracy: %0.3f' % 
accuracy_score(y_test, logistic.predict(X_test) ) ) 


In-sample accuracy: 90.973 
Out-of-sample accuracy: 9.958 


The example splits the data into training and testing sets, enabling you to check 
the efficacy of the logistic regression model on data that the model hasn’t used for 
learning. The resulting coefficients tell you the probability of a particular class’s 
being in the target class (which is any class encoded using a value of 1). If a coef- 
ficient increases the likelihood, it will have a positive coefficient; otherwise, the 
coefficient is negative. 


for var,coef in zip(boston. feature_names, 
logistic.coef_[@]): 
print (CAS : %7.3f" %(var, coef)) 


Reading the results on your screen, you can see that in Boston, criminality (CRIM) 
has some effect on prices. However, the level of poverty (LSTAT), distance from 
work (DIS), and pollution (NOX) all have much greater effects. Moreover, contrary 
to linear regression, logistic regression doesn’t simply output the resulting class 
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(in this case a 1 or a 0) but also estimates the probability of the observation’s being 
part of one of the two classes. 


print('\nclasses:',logistic.classes_) 
print('\nProbs: \n', logistic.predict_proba(X_test)[:3,:]) 


classes: [@ 1] 


Probs: 
[[ @.39022779 @.60977221 | 
[ @.93856655 @.06143345] 
[ @.98425623 @.01574377] ] 


In this small sample, only the first case has a 61 percent probability of being an 
expensive housing area. When you perform predictions using this approach, you 
also know the probability that your forecast is accurate and act accordingly, 
choosing only predictions with the right level of accuracy. (For instance, you 
might pick only predictions that exceed an 80 percent likelihood.) 


Using probabilities, you can guess a class (the most probable one), but you can 

also order all your predictions with respect to being part of that class. This is 

especially useful for medical purposes, ranking a prediction in terms of likelihood 
TIP with respect to other cases. 


Handling multiple classes 


In a previous problem, K-Nearest Neighbors automatically determined how to 
handle a multiple class problem. (Chapter 14 presents an example showing how to 
guess a single solution from three iris species.) Most algorithms that predict 
probabilities or a score for class automatically handle multiclass problems using 
two different strategies: 


>> One Versus Rest (OvR): The algorithm compares every class with the 
remaining classes, building a model for every class. The class with the highest 
probability is the chosen one. So if a problem has three classes to guess, the 
algorithm also uses three models. This book uses the OneVsRestClassi fier 
class from Scikit-learn to demonstrate this strategy. 


>> One Versus One (OvO): The algorithm compares every class against every 
remaining class, building a number of models equivalent ton x» (n-1) / 2, 
where n is the number of classes. So if a problem has five classes to guess, the 
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algorithm uses ten models. This book uses the OneVsOneC lassi fier class 
from Scikit-learn to demonstrate this strategy. The class that wins the most is 
the chosen one. 


In the case of logistic regression, the default multiclass strategy is the One 
Versus Rest. 


Guessing the Right Features 


Having many features to work with may seem to address the need for machine 
learning to understand a problem fully. However, just having features doesn’t 
solve anything; you need the right features to solve problems. The following sec- 
tions discuss how to ensure that you have the right features when performing 
machine learning tasks. 


Defining the outcome of features 
that don't work together 


As previously mentioned, having many features and making them work together 
may incorrectly indicate that your model is working well when it really isn’t. 
Unless you use cross-validation, error measures such as R? can be misleading 
because the number of features can easily inflate it, even if the feature doesn’t 
contain relevant information. The following example shows what happens to R? 
when you add just random features. 


from sklearn.cross_validation import train_test_split 
from sklearn.metrics import r2_score 
X_train, X_test, y_train, y_test = train_test_split(X, 
y, test_size=0.33, random_state=42 ) 
check = [2«**i for i in range(8)] 
for i in range(2**7+1 ): 
X_train = np.column_stack((X_train,np.random. random( 
X_train.shape[@] ))) 
X_test = np.column_stack((X_test,np.random.random( 
X_test.shape[@] ))) 
regression. fit(X_train, y_train) 
if i in check: 
print ("Random features: %i -> R2: %0.3f" % 
(4, 


r2_score(y_train, regression. predict(X_train) ))) 
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What seems like an increased predictive capability is really just an illusion. You 
can reveal what happened by checking the test set and discovering that the model 
performance has decreased. 


regression. fit(X_train, y_train) 
print ('R2 %0.3f' 
% r2_score(y_test, regression. predict(X_test) ) ) 
# Please notice that the R2 result may change from run to 
# run due to the random nature of the experiment 


R2 0.474 


Solving overfitting by using selection 


Regularization is an effective, fast, and easy solution to implement when you have 
many features and want to reduce the variance of the estimates due to multicol- 
linearity between your predictors. Regularization works by adding a penalty to the 
cost function. The penalization is a summation of the coefficients. If the coeffi- 
cients are squared (so that positive and negative values can’t cancel each other), 
it’s an L2 regularization (also called the Ridge). When you use the coefficient abso- 
lute value, it’s an L1 regularization (also called the Lasso). 


However, regularization doesn’t always work perfectly. L2 regularization keeps all 
the features in the model and balances the contribution of each of them. In an L2 
solution, if two variables correlate well, each one contributes equally to the solu- 
tion for a portion, whereas without regularization, their shared contribution 
would have been unequally distributed. 


Alternatively, Li brings highly correlated features out of the model by making 
their coefficient zero, thus proposing a real selection among features. In fact, set- 
ting the coefficient to zero is just like excluding the feature from the model. When 
multicollinearity is high, the choice of which predictor to set to zero becomes a bit 
random, and you can get various solutions characterized by differently excluded 
features. Such solution instability may prove a nuisance, making the L1 solution 
less than ideal. 


Scholars have found a fix by creating various solutions based on L1 regularization 
and then looking at how the coefficients behave across solutions. In this case, the 
algorithm picks only the stable coefficients (the ones that are seldom set to zero). 
You can read more about this technique on the Scikit-learn website at http: // 
scikit-learn.org/stable/auto_examples/linear_model/ 
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plot_sparse_recovery.html. The following example modifies the polynomial 
expansions example using L2 regularization (Ridge regression) and reduces the 
influence of redundant coefficients created by the expansion procedure: 


from sklearn.preprocessing import PolynomialFeatures 
from sklearn.cross_validation import train_test_sp1lit 


pf = PolynomialFeatures(degree=2 ) 
poly_X = pf.fit_transform(X) 
X_train, X_test, y_train, y_test = 
train_test_split(poly_X, 
y, test_size=@.33, random_state=42) 


from sklearn.linear_model import Ridge 
reg_regression = Ridge(alpha=@.1, normalize=True) 
reg_regression. fit(X_train, y_train) 
print ('R2: %0.3f' 

% r2_score(y_test,reg_regression.predict(X_test) ) ) 


R2: 0.819 


The next example uses L1 regularization. In this case, the example relies on R 
because it provides an effective library for penalized regression called glmnet. You 
can install the required support using the following command: 


install.packages("glmnet") 


Stanford professors and scholars Friedman, Hastie, Tibshirani, and Simon created 
this package. Prof. Trevor Hastie actually maintains the R package. You can find a 
complete vignette explaining the library’s functionality at: http: //web.stanford. 
edu/~hastie/glmnet/glmnet_alpha.html. The example shown in Figure 15-3 
tries to visualize the coefficient path, representing how the coefficient value 
changes according to regularization strength. The lambda parameter decides the 
regularization strength. As before, the following R example relies on the Boston 
dataset, which is obtained from the MASS package. 


data(Boston, package="MASS") 

library(glmnet ) 

X <- as.matrix(scale(Boston[,4:ncol(Boston)-1] )) 

y <- as.numeric(Boston[,ncol(Boston) ] ) 

fit = glmnet(X, y, family="gaussian", alpha=1, 
standardize=FALSE) 

plot(fit, xvar="lambda", label=TRUE, lwd=2) 
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FIGURE 15-3: 
Visualizing 
the path of 
coefficients 

using various 
degrees of L1 
regularization. 
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Log Lambda 


The chart represents all the coefficients by placing their standardized values on 
the ordinate, the vertical axis. As for as the abscissa, the scale uses a log function 
of lambda so that you get an idea of the small value of lambda at the leftmost side 
of the chart (where it’s just like a standard regression). The abscissa also shows 
another scale, placed in the uppermost part of the chart, that tells you how many 
coefficients are nonzero at that lambda value. Running from left to right, you can 
observe how coefficients decrease in absolute value until they become zero, telling 
you the story of how regularization affects a model. Naturally, if you just need to 
estimate a model and the coefficients for the best predictions, you need to find the 
correct lambda value using cross-validation: 


cv <- cv.glmnet(X, y, family="gaussian", 
alpha=1, standardize=FALSE) 
coef(cv, s="lambda.min") 


Learning One Example at a Time 


Finding the right coefficients for a linear model is just a matter of time and mem- 
ory. However, sometimes a system won’t have enough memory to store a huge 
dataset. In this case, you must resort to other means, such as learning from one 
example at a time rather than having all of them loaded into memory. The follow- 
ing sections help you understand the one-example-at-a-time approach to 
learning. 
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Using gradient descent 


The gradient descent finds the right way to minimize the cost function one itera- 
tion at a time. After each step, it accounts for all the model’s summed errors and 
updates the coefficients in order to make the error even smaller during the next 
data iteration. The efficiency of this approach derives from considering all the 
examples in the sample. The drawback of this approach is that you must load all 
the data into memory. 


Unfortunately, you can’t always store all the data in memory because some data- 
sets are huge. In addition, learning using simple learners requires large amounts 
of data in order to build effective models (more data helps to correctly disambigu- 
ate multicollinearity). Getting and storing chunks of data on your hard disk is 
always possible, but it’s not feasible because of the need to perform matrix mul- 
tiplication, which would require lots of data swapping from disk (random disk 
access) to select rows and columns. Scientists who have worked on the problem 
have found an effective solution. Instead of learning from all the data after having 
seen it all (which is called an iteration), the algorithm learns from one example at 
a time, as picked from storage using sequential access, and then goes on to learn 
from the next example. When the algorithm has learned all the examples, it starts 
again from the beginning unless some stopping criterion is met (for instance, 
completing a predefined number of iterations). 


A data stream is the data that flows from disk, one example at a time. Streaming is 
the action of passing data from storage to memory. The out-of-core learning style 
(online learning) occurs when an algorithm learns from a stream, which is a via- 
ble learning strategy discussed at the end of Chapter 10. 


Some common sources of data streams are web traffic, sensors, satellites, and 
surveillance recordings. An intuitive example of a stream is, for instance, the data 
flow produced instant by instant using a sensor, or the tweets generated by a 
Twitter streamline. You can also stream common data that you’re used to keeping 
in memory. For example, if a data matrix is too big, you can treat it as a data 
stream and start learning one row at a time, pulling it from a text file or a data- 
base. This kind of streaming is the online-learning style. 


Understanding how SGD is different 


Stochastic gradient descent (SGD) is a slight variation on the gradient descent 
algorithm. It provides an update procedure for estimating beta coefficients. Linear 
models are perfectly at ease with this approach. 


In SGD, the formulation remains the same as in the standard version of gradient 


descent (called the batch version, in contrast to the online version), except for the 
update. In SGD, the update is executed a single instance at a time, allowing the 
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algorithm to leave core data in storage and place just the single observation needed 
to change the coefficient vector in memory: 


w;=w;-a(Xw—-y)x; 


As with the gradient descent algorithm, the algorithm updates the coefficient, w, of 
feature j by subtracting the difference between the prediction and the real response. 
It then multiplies the difference by the value of the feature j and by a learning factor 
alpha (which can reduce or increase the effect of the update on the coefficient). 


There are other subtle differences when using SGD rather than the gradient 
descent. The most important difference is the stochastic term in the name of this 
online learning algorithm. In fact, SGD expects an example at a time, drawn ran- 
domly from the available examples (random sampling). The problem with online 
learning is that example ordering changes the way the algorithm guesses beta 
coefficients. It’s the greediness principle discussed when dealing with decision 
trees (see Chapter 12): With partial optimization, one example can change the way 
the algorithm reaches the optimum value, creating a different set of coefficients 
than would have happened without that example. 


As a practical example, just keep in mind that SGD can learn the order in which it 
sees the examples. So if the algorithm performs any kind of ordering (historical, 
alphabetical, or, worse, related to the response variable), it will invariably learn it. 
Only random sampling (meaningless ordering) allows you to obtain a reliable 
online model that works effectively on unseen data. When streaming data, you 
need to randomly reorder your data (data shuffling). 


The SGD algorithm, contrary to batch learning, needs a much larger number of itera- 
tions in order to get the right global direction in spite of the contrary indications that 
come from single examples. In fact, the algorithm updates after each new example, 
and the consequent journey toward an optimum set of parameters is more erratic in 
comparison to an optimization made on a batch, which immediately tends to get the 
right direction because it’s derived from data as a whole, as shown in Figure 15-4. 


In this case, the learning rate has even more importance because it dictates how 
the SGD optimization procedure can resist bad examples. In fact, if the learning 
rate is high, an outlying example could derail the algorithm completely, prevent- 
ing it from reaching a good result. On the other hand, high learning rates help to 
keep the algorithm learning from examples. A good strategy is to use a flexible 
learning rate, that is, starting with a flexible learning rate and making it rigid as 
the number of examples it has seen grows. 


If you think about it, as a learning strategy the flexible learning rate is similar to 
the one adopted by our brain: flexible and open to learn when we are children and 
more difficult (but not impossible) to change when we grow older. You use 
Python’s Scikit-learn to perform out-of-core learning because R doesn’t offer an 
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FIGURE 15-4: 
Visualizing the 
different 
optimization 
paths on the 
same data 
problem. 


analogous solution. Python offers two SGD implementations, one for classifica- 
tion problems (SGDClassi fier) and one for regression problems (SGDRegressor ). 
Both of these methods, apart from the fit method (used to fit data in memory), 
feature a partial_fit method that keeps previous results and learning schedule 
in memory, and continues learning as new examples arrive. With the partial_ 
fit method, you can partially fit small chunks of data, or even single examples, 
and continue feeding the algorithm data until it finds a satisfactory result. 
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Both SGD classification and regression implementations in Scikit-learn feature 
different loss functions that you can apply to the stochastic gradient descent opti- 
mization. Only two of those functions refer to the methods dealt with in this 
chapter: 


>> loss='squared_loss': Ordinary least squares (OLS) for linear regression 


>> loss='1log': Classical logistic regression 


The other implementations (hinge, huber, epsilon insensitive) optimize a differ- 
ent kind of loss function similar to the perceptron. Chapter 17 explains them in 
more detail when talking about support vector machines. 


To demonstrate the effectiveness of out-of-core learning, the following example 
sets up a brief experiment in Python using regression and squared_loss as the 
cost function. It demonstrates how beta coefficients change as the algorithm sees 
more examples. The example also passes the same data multiple times to rein- 
force data pattern learning. Using a test set guarantees a fair evaluation, providing 
measures of the capability of the algorithm to generalize to out-of-sample data. 
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You can freely reiterate the experiment after changing the learning parameters by 
modifying learning_rate and its related parameters eta@ and power_t. The 
learning rate determines how each observed example impacts on the optimization 
process, and, apart from the constant option, you can modify it using optimal 
(suitable for classification) and invscaling (for regression). In particular, specific 
parameters, eta@ and power_t, control invscaling according to the formula 


learning rate n = eta0/(t* power_t) 


When the power_t parameter is a number below one, the formula creates a gradual 
decay of the learning rate (t is number of examples seen), allowing it to fix what it 
previously learned and making it resistant to any changes induced by anomalies. 


This Python example uses the Boston dataset after shuffling it and separating it 
into training and testing sets. The example reports the coefficients’ vector and the 
error measures after it sees a variable number of examples in powers of two (in 
order to represent learning at different steps). The experiment shows how long it 
takes before R? increases and the value of coefficients stabilize. 


from sklearn.cross_validation import train_test_sp1lit 

from sklearn.linear_model import SGDRegressor 

X_train, X_test, y_train, y_test = train_test_split(X, 
y, test_size=0.33, random_state=42) 

SGD = SGDRegressor(penalty=None, 
learning_rate='invscaling', 
etaQ@=0.01, power_t=@.25) 


power = 17 
check = [2xxi for i in range(power+1 ) ] 
for i in range(4@0): 
for j in range(X_train.shape[Q] ): 
SGD. partial_fit(X_train[j,:].reshape(1,13), 
y_train|[j].reshape(1, )) 
count = (j+1) + X_train.shape[@] * i 
if count in check: 
R2 = r2_score(y_test,SGD.predict(X_test) ) 
print ('Example %6i R2 %@.3f coef: %s' % 
(count, R2, ' '.join(map(lambda x:'%@.3f' %x, SGD.coef_)))) 


Example 131072 R2 @.724 coef: -1.098 0.891 0.374 0.849 
-1.905 2.752 -0.371 -3.005 2.026 -1.396 -2.011 
1.102 -3.956 


No matter the amount of data, you can always fit a simple but effective linear 
regression model using SGD online learning capabilities. 
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IN THIS CHAPTER 


Upgrading the perceptron to the 
interconnection paradigm 


Structuring neural architectures made 
of nodes and connections 


Getting a glimpse of the 
backpropagation algorithm 


Understanding what deep learning is 
and what it can achieve 


Hitting Complexity 
with Neural Networks 


s you journey in the world of machine learning, you often see metaphors 

from the natural world to explain the details of algorithms. This chapter 

presents a family of learning algorithms that directly derives inspiration 
from how the brain works. They are neural networks, the core algorithms of the 
connectionists’ tribe. 


Starting with the idea of reverse-engineering how a brain processes signals, the 
connectionists base neural networks on biological analogies and their compo- 
nents, using brain terms such as neurons and axons as names. However, you’ll 
discover that neural networks resemble nothing more than a sophisticated kind of 
linear regression when you check their math formulations. Yet, these algorithms 
are extraordinarily effective against complex problems such as image and sound 
recognition, or machine language translation. They also execute quickly when 
predicting. 


Well-devised neural networks use the name deep learning and are behind such 
power tools as Siri and other digital assistants. They are behind the more 
astonishing machine learning applications as well. For instance, you see them 
at work in this incredible demonstration by Microsoft CEO Rick Rashid, who is 
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speaking in English while being simultaneously translated into Chinese: https: // 
www. youtube. com/watch?v=Nu-n1QqFCkg. If an AI revolution is about to happen, 
the increased learning capabilities of neural networks will likely drive it. 


Learning and Imitating from Nature 


FIGURE 16-1: 
Learning logical 
XOR using a 
single separating 
line isn't possible. 


The core neural network algorithm is the neuron (also called a unit). Many neurons 
arranged in an interconnected structure make up a neural network, with each 
neuron linking to the inputs and outputs of other neurons. Thus, a neuron can 
input features from examples or the results of other neurons, depending on its 
location in the neural network. 


Something similar to the neuron, the perceptron, appears earlier in this book, 
although it uses a simpler structure and function. When the psychologist Rosen- 
blatt conceived the perceptron, he thought of it as a simplified mathematical ver- 
sion of a brain neuron. A perceptron takes values as inputs from the nearby 
environment (the dataset), weights them (as brain cells do, based on the strength 
of the in-bound connections), sums all the weighted values, and activates when 
the sum exceeds a threshold. This threshold outputs a value of 1; otherwise, its 
prediction is 0. Unfortunately, a perceptron can’t learn when the classes it tries to 
process aren’t linearly separable. However, scholars discovered that even though 
a single perceptron couldn’t learn the logical operation XOR shown in Figure 16-1 
(the exclusive or, which is true only when the inputs are dissimilar), two percep- 
trons working together could. 





OR function AND function XOR function 


Neurons in a neural network are a further evolution of the perceptron: they take 
many weighted values as inputs, sum them, and provide the summation as the 
result, just as a perceptron does. However, they also provide a more sophisticated 
transformation of the summation, something that the perceptron can’t do. In 
observing nature, scientists noticed that neurons receive signals but don’t always 
release a signal of their own. It depends on the amount of signal received. When a 
neuron acquires enough stimuli, it fires an answer; otherwise it remains silent. In 
a similar fashion, algorithmic neurons, after receiving weighted values, sum them 
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FIGURE 16-2: 
Plots of different 
activation 
functions. 


TIP 


and use an activation function to evaluate the result, which transforms it in a non- 
linear way. For instance, the activation function can release a zero value unless 
the input achieves a certain threshold, or it can dampen or enhance a value by 
nonlinearly rescaling it, thus transmitting a rescaled signal. 


A neural network has different activation functions, as shown in Figure 16-2. The 
linear function doesn’t apply any transformation, and it’s seldom used because it 
reduces a neural network to a regression with polynomial transformations. Neural 
networks commonly use the sigmoid or the hyperbolic tan. 


Binary step Logistic TanH 
10 
10 10 
0.8 0.8 05 
0.6 0.6 
0.0 
0.4 0.4 
0.2 0.2 “05 
0.0 0.0 
-1.0 
-8-6-4-2024 68 -86-4202468 -866-4202468 


The figure shows how an input (expressed on the horizontal axis) can transform 
an output into something else (expressed on the vertical axis). The examples show 
a binary step, a logistic, and a tangent hyperbolic activation function. 


You learn more about activation functions later in the chapter, but note for now 
that activation functions clearly work well in certain ranges of x values. For this 
reason, you should always rescale inputs to a neural network using statistical 
standardization (zero mean and unit variance) or normalize the input in the range 
from 0 to 1 or from -1 to 1. 


Going forth with feed-forward 


In a neural network, you have first to consider the architecture, which is how the 
neural network components are arranged. Contrary to other algorithms, which 
have a fixed pipeline that determines how algorithms receive and process data, 
neural networks require you to decide how information flows by fixing the number 
of units (the neurons) and their distribution in layers, as shown in Figure 16-3. 
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FIGURE 16-3: 


architecture 
of a neural 
network. 


Layer 4 


Layer 3 
Layer1 


Layer 2 


The figure shows a simple neural architecture. Note how the layers filter informa- 
tion in a progressive way. This is a feed-forward input because data feeds one way 
forward into the network. Connections exclusively link the units in one layer with 
the units in the following layer (information flow from left to right). No connec- 
tions exist between units in the same layer or with units outside the next layer. 
Moreover, the information pushes forward (from the left to the right). Processed 
data never returns to previous neuron layers. 


Using a neural network is like using stratified filtering system for water: You pour 
the water from above and the water is filtered at the bottom. The water has no way 
to go back; it just goes forward and straight down, and never laterally. In the same 
way, neural networks force data features to flow through the network and mix 
with each other only according to the network’s architecture. By using the best 
architecture to mix features, the neural network creates new composed features at 
every layer and helps achieve better predictions. Unfortunately, there is no way to 
determine the best architecture without empirically trying different solutions and 
testing whether output data helps predict your target values after flowing through 
the network. 


The first and last layers play an important role. The first layer, called the input 
layer, picks up the features from each data example processed by the network. The 
last layer, called the output layer, releases the results. 


A neural network can process only numeric, continuous information; it can’t be 
constrained to work with qualitative variables (for example, labels indicating a 
quality such as red, blue, or green in an image). You can process qualitative vari- 
ables by transforming them into a continuous numeric value, such as a series of 
binary values, as discussed in Chapter 9 in the material about working with data. 
When a neural network processes a binary variable, the neuron treats the variable 
as a generic number and turns the binary values into other values, even negative 
ones, by processing across units. 
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Note the limitation of dealing only with numeric values, because you can’t expect 
the last layer to output a nonnumeric label prediction. When dealing with a regres- 
sion problem, the last layer is a single unit. Likewise, when you’re working with a 
classification and you have output that must choose from a number n of classes, 
you should have n terminal units, each one representing a score linked to the 
probability of the represented class. Therefore, when classifying a multiclass 
problem such as iris species (as in the Iris dataset demonstration found in 
Chapter 14), the final layer has as many units as species. For instance, in the clas- 
sical iris classification example, created by the famous statistician Fisher, you 
have three classes: setosa, versicolor, and virginica. In a neural network 
based on the Iris dataset, you therefore have three units representing one of the 
three iris species. For each example, the predicted class is the one that gets the 
higher score at the end. 


In some neural networks, there are special final layers, called a softmax, which 
can adjust the probability of each class based on the values received from a previ- 
ous layer. 


In classification, the final layer may represent both a partition of probabilities 
thanks to softmax (a multiclass problem in which total probabilities sum to 
100 percent) or an independent score prediction (because an example can have 
more classes, which is a multilabel problem in which summed probabilities can be 
more than 100 percent). When the classification problem is a binary classification, 
a single node suffices. Also, in regression, you can have multiple output units, 
each one representing a different regression problem (for instance, in forecasting, 
you can have different predictions for the next day, week, month, and so on). 


Going even deeper down the rabbit hole 


Neural networks have different layers, each one having its own weights. Because 
the neural network segregates computations by layers, knowing the reference 
layer is important because it means accounting for certain units and connections. 
Thus you can refer to every layer using a specific number and generically talk 
about each layer using the letter 1. 


Each layer can have a different number of units, and the number of units located 
between two layers dictates the number of connections. By multiplying the num- 
ber of units in the starting layer with the number in the following layer, you can 
determine the total number of connections between the two: number of connections“ = 
units) * units, 


A matrix of weights, usually named with the uppercase Greek letter theta (O), 


represents the connections. For ease of reading, the book uses the capital letter W, 
which is a fine choice because it is a matrix. Thus, you can use W! to refer to the 
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connection weights from layer 1 to layer 2, W? for the connections from layer 2 to 
layer 3, and so on. 


You may see references to the layers between the input and the output as hidden 
layers and count layers starting from the first hidden layer. This is just a different 
convention from the one used in the book. The examples in the book always start 
counting from the input layer, so the first hidden layer is layer number 2. 


Weights represent the strength of the connection between neurons in the net- 
work. When the weight of the connection between two layers is small, it means 
that the network dumps values flowing between them and signals that taking this 
route won’t likely influence the final prediction. On the contrary, a large positive 
or negative value affects the values that the next layer receives, thus determining 
certain predictions. This approach is clearly analogous to brain cells, which don’t 
stand alone but are in connection with other cells. As someone grows in experi- 
ence, connections between neurons tend to weaken or strengthen to active or 
deactivate certain brain network cell regions, causing other processing or an 
activity (a reaction to a danger, for instance, if the processed information signals 
a life-threatening situation). 


Now that you know some conventions regarding layers, units, and connections, 
you can start examining the operations that neural networks execute in detail. 
First, you can call inputs and outputs in different ways: 


>> a: The result stored in a unit in the neural network after being processed by 
the activation function (called g). This is the final output that is sent further 
along the network. 


>> z: The multiplication between a and the weights from the W matrix. z repre- 
sents the signal going through the connections, analogous to water in pipes 
that flows at a higher or lower pressure depending on the pipe thickness. In 
the same way, the values received from the previous layer get higher or lower 
values because of the connection weights used to transmit them. 


Each successive layer of units in a neural network progressively processes the 
values taken from the features, same as in a conveyor belt. As data transmits in 
the network, it arrives into each unit as a value produced by the summation of the 
values present in the previous layer and weighted by connections represented in 
the matrix W. When the data with added bias exceeds a certain threshold, the 
activation function increases the value stored in the unit; otherwise, it extin- 
guishes the signal by reducing it. After processing by the activation function, the 
result is ready to push forward to the connection linked to the next layer. These 
steps repeat for each layer until the values reach the end and you have a result, as 
shown in Figure 16-4. 
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FIGURE 16-4: 

A detail of the 
feed-forward 
process ina 
neural network. 





The figure shows a detail of the process that involves two units pushing their 
results to another unit. This event happens in every part of the network. When you 
understand the passage from two neurons to one, you can understand the entire 
feed-forward process, even when more layers and neurons are involved. For more 
explanation, here are the seven steps used to produce a prediction in a neural 
network made of four layers (like the one shown earlier in the chapter in 
Figure 16-3): 


1. 


The first layer (notice the superscript 1 on a) loads the value of each feature in 
a different unit: 


a= X 


The weights of the connections bridging the input layer with the second layer 
are multiplied by the values of the units in the first layer. A matrix multiplica- 
tion weights and sums the inputs for the second layer together. 


zwa) 


The algorithm adds a bias constant to layer two before running the activation 
function. The activation function transforms the second-layer inputs. The 
resulting values are ready to pass to the connections. 


al) = g(z® + bias’) 
The third-layer connections weigh and sum the outputs of layer two. 
zO = wl2)a(2) 


The algorithm adds a bias constant to layer three before running the activation 
function. The activation function transforms the layer-three inputs. 


a® = g(z® + bias) 


The layer-three outputs are weighted and summed by the connections to the 
output layer. 


24) = wal) 
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7. Finally, the algorithm adds a bias constant to layer four before running the 
activation function. The output units receive their inputs and transform the 
input using the activation function. After this final transformation, the output 
units are ready to release the resulting predictions of the neural network. 


a™ = g(z\) + bias“) 


The activation function plays the role of a signal filter, helping to select the rele- 
vant signals and avoid the weak and noisy ones (because it discards values below 
a certain threshold). Activation functions also provide nonlinearity to the output 
because they enhance or damp the values passing through them in a nonpropor- 
tional way. 


The weights of the connections provide a way to mix and compose the features in 
a new way, creating new features in a way not too different from a polynomial 
expansion. The activation renders nonlinear the resulting recombination of the 
features by the connections. Both of these neural network components enable the 
algorithm to learn complex target functions that represent the relationship 
between the input features and the target outcome. 


Getting Back with Backpropagation 


From an architectural perspective, a neural network does a great job of mixing 
signals from examples and turning them into new features to achieve an approxi- 
mation of complex nonlinear functions (functions that you can’t represent as a 
straight line in the features’ space). To create this capability, neural networks 
work as universal approximators (for more details, go to https: //en.wikipedia. 
org/wiki/Universal_approximation_theorem), which means that they can 
guess any target function. However, you have to consider that one aspect of this 
feature is the capacity to model complex functions (representation capability), and 
another aspect is the capability to learn from data effectively. Learning occurs in 
a brain because of the formation and modification of synapses between neurons, 
based on stimuli received by trial-and-error experience. Neural networks provide 
a way to replicate this process as a mathematical formulation called 
backpropagation. 


Since its early appearance in the 1970s, the backpropagation algorithm has been 
given many fixes. Each neural network learning process improvement resulted in 
new applications and a renewed interest in the technique. In addition, the current 
deep learning revolution, a revival of neural networks, which were abandoned at 
the beginning of the 1990s, is due to key advances in the way neural networks 
learn from their errors. As seen in other algorithms, the cost function activates 
the necessity to learn certain examples better (large errors correspond to high 
costs). When an example with a large error occurs, the cost function outputs a 
high value that is minimized by changing the parameters in the algorithm. 
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In linear regression, finding an update rule to apply to each parameter (the vector 
of beta coefficients) is straightforward. However, in a neural network, things are 
a bit more complicated. The architecture is variable and the parameter coefficients 
(the connections) relate to each other because the connections in a layer depend 
on how the connections in the previous layers recombined the inputs. The solu- 
tion to this problem is the backpropagation algorithm. Backpropagation is a smart 
way to propagate the errors back into the network and make each connection 
adjust its weights accordingly. If you initially feed-forward propagated informa- 
tion to the network, it’s time to go backward and give feedback on what went 
wrong in the forward phase. 


Discovering how backpropagation works isn’t complicated, even though demon- 
strating how it works using formulas and mathematics requires derivatives and 
the proving of some formulations, which is quite tricky and beyond the scope of 
this book. To get a sense of how backpropagation operates, start from the end of 
the network, just at the moment when an example has been processed and you 
have a prediction as an output. At this point, you can compare it with the real 
result and, by subtracting the two results, get an offset, which is the error. Now 
that you know the mismatch of the results at the output layer, you can progress 
backward in order to distribute it along all the units in the network. 


The cost function of a neural network for classification is based on cross-entropy 
(as seen in logistic regression): 


Cost =y *log(hw(X))+(1-y)*log(1—hy (X)) 


This is a formulation involving logarithms. It refers to the prediction produced by 
the neural network and expressed as h,,(X) (which reads as the result of the net- 
work given connections W and X as input). To make things easier, when thinking 
of the cost, it helps to simply think of the formulation as computing the offset 
between the expected results and the neural network output. 


The first step in transmitting the error back into the network relies on backward 
multiplication. Because the values fed to the output layer are made of the contri- 
butions of all units, proportional to the weight of their connections, you can 
redistribute the error according to each contribution. For instance, the vector of 
errors of a layer n in the network, a vector indicated by the Greek letter delta (ô), 
is the result of the following formulation: 


5M wt g gard 


This formula says that, starting from the final delta, you can continue redistribut- 
ing delta going backward in the network and using the weights you used to push 
forward the value to partition the error to the different units. In this way, you can 
get the terminal error redistributed to each neural unit, and you can use it to 
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recalculate a more appropriate weight for each network connection to minimize 
the error. To update the weights W of layer 1, you just apply the following 
formula: 


WO =W 4780 + g(z” ) za® 


It may appear to be a puzzling formula at first sight, but it is a summation, and 
you can discover how it works by looking at its elements. First, look at the func- 
tion g’. It’s the first derivative of the activation function g, evaluated by the input 
values z. This book discusses using derivatives in Chapter 10. In fact, this is the 
gradient descent method. Gradient descent determines how to reduce the error 
measure by finding, among the possible combinations of values, the weights that 
most reduce the error. 


The Greek letter eta (n), sometimes also called alpha (aœ) or epsilon (e) depending 
on the textbook you consult, is the learning rate. As found in other algorithms, it 
reduces the effect of the update suggested by the gradient descent derivative. In 
fact, the direction provided may be only partially correct or just roughly correct. 
By taking multiple small steps in the descent, the algorithm can take a more pre- 
cise direction toward the global minimum error, which is the target you want to 
achieve (that is, a neural network producing the least possible prediction error). 


Different methods are available for setting the right eta value, because the opti- 
mization largely depends on it. One method sets the eta value starting high and 
reduces it during the optimization process. Another method variably increases or 
decreases eta based on the improvements obtained by the algorithm: Large 
improvements call a larger eta (because the descent is easy and straight); smaller 
improvements call a smaller eta so that the optimization will move slower, look- 
ing for the best opportunities to descend. Think of it as being on a tortuous path 
in the mountains: You slow down and try not to be struck or thrown off the road 
as you descend. 


Most implementations offer an automatic setting of the correct eta. You need to 
note this setting’s relevance when training a neural network because it’s one of 
the important parameters to tweak to obtain better predictions, together with the 
layer architecture. 


Weight updates can happen in different ways with respect to the training set of 
examples: 


>> Online mode: The weight update happens after every example traverses the 
network. In this way, the algorithm treats the learning examples as a stream 
from which to learn in real time. This mode is perfect when you have to 
learn out-of-core, that is, when the training set can't fit into RAM memory. 
However, this method is sensitive to outliers, so you have to keep your 
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learning rate low. (Consequently, the algorithm is slow to converge to a 
solution.) 


>> Batch mode: The weight update happens after seeing all the examples in the 
training set. This technique makes optimization fast and less subject to having 
variance appear in the example stream. In batch mode, the backpropagation 
considers the summed gradients of all examples. 


>> Mini-batch (or stochastic) mode: The weight update happens after the 
network has processed a subsample of randomly selected training set 
examples. This approach mixes the advantages of online mode (low memory 
usage) and batch mode (a rapid convergence), while introducing a random 
element (the subsampling) to avoid having the gradient descent stuck ina 
local minima. 


Struggling with Overfitting 


Given the neural network architecture, you can imagine how easily the algorithm 
could learn almost anything from data, especially if you added too many layers. In 
fact, the algorithm does so well that its predictions are often affected by a high 
estimate variance called overfitting. Overfitting causes the neural network to learn 
every detail of the training examples, which makes it possible to replicate them in 
the prediction phase. But apart from the training set, it won’t ever correctly pre- 
dict anything different. The following sections discuss some of the issues with 
overfitting in more detail. 


Understanding the problem 


When you use a neural network for a real problem, you have to take some caution- 
ary steps in a much stricter way than you do with other algorithms. Neural net- 
works are frailer and more prone to relevant errors than other machine learning 
solutions. 


First, you carefully split your data into training, validation, and test sets. Before 
the algorithm learns from data, you must evaluate the goodness of your parame- 
ters: architecture (the number of layers and nodes in them); activation functions; 
learning parameter; and number of iterations. In particular, the architecture 
offers great opportunities to create powerful predictive models at a high risk of 
overfitting. The learning parameter controls how fast a network learns from data, 
but it may not suffice in preventing overfitting the training data. 


CHAPTER 16 Hitting Complexity with Neural Networks 289 


290 


TIP 


You have two possible solutions to this problem. The first is regularization, as in 
linear and logistic regression. You can sum all connection coefficients, squared or 
in absolute value, to penalize models with too many coefficients with high values 
(achieved by L2 regularization) or with values different from zero (achieved by L1 
regularization). The second solution is also effective because it controls when 
overfitting happens. It’s called early-stop and works by checking the cost function 
on the validation set as the algorithm learns from the training set. 


You may not realize when your model starts overfitting. The cost function calcu- 
lated using the training set keeps improving as optimization progresses. However, 
as soon as you Start recording noise from the data and stop learning general rules, 
you can check the cost function on an out-of-sample (the validation sample). At 
some point, you’ll notice that it stops improving and starts worsening, which 
means that your model has reached its learning limit. 


Opening the black box 


The best way to learn how to build a neural network is to build one. Python offers 
a wealth of possible implementations for neural networks and deep learning. 
Python has libraries such as Theano (http://deeplearning.net/software/ 
theano/), which allows complex computations at an abstract level, and more 
practical packages, such as Lasagne (https: //github.com/Lasagne/Lasagne), 
which allows you to build neural networks, though it still requires some abstrac- 
tions. For this reason, you need wrappers, such as Nolearn, which is compatible 
with Scikit-learn (https: //github.com/dnouri/nolearn), or Keras (https: // 
github.com/fchollet/keras), which can also wrap the TensorFlow (https: // 
github .com/tensor flow/tensor flow) library released by Google that has the 
potential to replace Theano as a software library for neural computation. 


R provides libraries that are less complicated and more accessible, such as 
nnet (https: //cran.r—-project .org/web/packages/nnet/), AMORE (https: // 
cran.r-project.org/web/packages/AMORE/), and neuralnet (https: //cran.r- 
project.org/web/packages/neuralnet/). These brief examples in R show how 
to train both a classification network (on the Iris dataset) and a regression net- 
work (on the Boston dataset). Starting from classification, the following code 
loads the dataset and splits it into training and test sets: 


library(MASS) 
library("neuralnet" ) 


target <- model.matrix( ~ Species - 1, data=iris ) 
colnames(target) <- c("setosa", "versicolor", "virginica" ) 
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set .seed(101 ) 
index <- sample(1:nrow(iris), 120) 


train_predictors <- iris[index, 1:4] 
test_predictors <- iris[-index, 1:4] 


Because neural networks rely on gradient descent, you need to standardize or 
normalize the inputs. Normalizing is better so that the minimum is zero and the 
maximum is one for every feature. Naturally, you learn how to make the numeric 
conversion using the training set only in order to avoid any chance of using infor- 
mation from the test out-of-sample. 


min_vector <- apply(train_predictors, 2, min) 
range_vector <- apply(train_predictors, 2, max) - 
apply(train_predictors, 2, min) 


train_scaled <- cbind(scale(train_predictors, 

min_vector, range_vector), 

target [ index, ] ) 

test_scaled <- cbind(scale(test_predictors, 
min_vector, range_vector), 
target [-index, ] ) 





summary (train_scaled) 


When the training set is ready, you can train the model to guess three binary vari- 
ables, with each one representing a class. The output is a value for each class 
proportional to its probability of being the real class. You pick a prediction by tak- 
ing the highest value. You can also visualize the network by using the internal plot 
and thus seeing the neural network architecture and the assigned weights, as 
shown in Figure 16-5. 


set.seed(102) 

nn_iris <- neuralnet(setosa + versicolor + virginica ~ 
Sepal.Length + Sepal.Width 
+ Petal.Length + Petal .Width, 
data=train_scaled, hidden=c(2), 
linear .output=F ) 


plot(nn_iris) 
predictions <- compute(nn_iris, test_scaled[,1:4] ) 


y_predicted <- apply(predictions$net.result,1,which.max) 
y_true <- apply(test_scaled[,5:7],1,which.max) 
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FIGURE 16-5: 


You can plot a 
trained neural 
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network. 


confusion_matrix <- table(y_true, y_predicted) 

accuracy <- sum(diag(confusion_matrix)) / 
sum(confusion_matrix) 

print (confusion_matrix) 

print (paste("Accuracy:",accuracy ) ) 
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Error: 0.899967 Steps: 4076 


The following example demonstrates how to predict house values in Boston, using 
the Boston dataset. The procedure is the same as in the previous classification, but 
here you have a single output unit. The code also plots the test set’s predicted 
results against the real values to verify the good fit of the model. 


no_examples <- nrow(Boston) 
features <- colnames(Boston) 


set.seed(101 ) 
index <- sample(1:no_examples, 400) 


train <- Boston[index, ] 
test <- Boston[-index, ] 


min_vector <- apply(train,2,min) 

range_vector <- apply(train,2,max) - apply(train,2,min) 

scaled_train <- scale(train,min_vector ,range_vector ) 

scaled_test <- scale(test, min_vector,range_vector ) 

', paste(features[1:13], 
collapse='+')) 


formula = paste("medv ~' 
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nn_boston <- neuralnet(formula, data=scaled_train, 
hidden=c(5,3), linear.output=T) 

predictions <- compute(nn_boston, scaled_test[,1:13] ) 

predicted_values <- (predictions$net.result * 
range_vector[14]) + min_vector[14] 


RMSE <- sqrt(mean((test[,14] - predicted_values)%2)) 
print (paste("RMSE:",RMSE)) 
plot(test[,14],predicted_values, cex=1.5) 
abline(Q,1,1lwd=1 ) 


Introducing Deep Learning 


After backpropagation, the next improvement in neural networks led to deep 
learning. Research continued in spite of AI winter and neural networks started to 
take advantage of the developments in CPUs and GPUs (the graphic processing 
units better known for their application in gaming but which are actually powerful 
computing units for matrix and vector calculations). These technologies make 
training neural networks an achievable task in a shorter time and accessible to 
more people. Research also opened a world of new applications. Neural networks 
can learn from huge amounts of data, and because they’re more prone to high 
variance than to bias, they can take advantage of big data, creating models that 
continuously perform better, depending on the amounts of data you feed them. 
However, you need large, complex networks for certain applications (to learn 
complex features, such as the characteristics of a series of images) and thus incur 
problems like the vanishing gradient. 


In fact, when training a large network, the error redistributes among the neurons 
favoring the layers nearest to the output layer. Layers that are further away receive 
smaller errors, sometimes too small, making training slow if not impossible. 
Thanks to the studies of scholars such as Geoffrey Hinton, new turnarounds help 
avoid the problem of the vanishing gradient. The result definitely helps a larger 
network, but deep learning isn’t just about neural networks with more layers and 
units. 


In addition, something inherently qualitative changed in deep learning as com- 
pared to shallow neural networks, shifting the paradigm in machine learning 
from feature creation (features that make learning easier) to feature learning 
(complex features automatically created on the basis of the actual features). Big 
players such as Google, Facebook, Microsoft, and IBM spotted the new trend and 
since 2012 have started acquiring companies and hiring experts (Hinton now 
works with Google; LeCun leads Facebook AI research) in the new fields of deep 
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learning. The Google Brain project, run by Andrew Ng and Jeff Dean, put together 
16,000 computers to calculate a deep learning network with more than a billion 
weights, thus enabling unsupervised learning from YouTube videos. 


There is a reason why the quality of deep learning is different. Of course, part of 
the difference is the increased usage of GPUs. Together with parallelism (more 
computers put in clusters and operating in parallel), GPUs allow you to success- 
fully apply pretraining, new activation functions, convolutional networks, and 
drop-out, a special kind of regularization different from L1 and L2. In fact, it has 
been estimated that a GPU can perform certain operations 70 times faster than 
any CPU, allowing a cut in training times for neural networks from weeks to days 
or even hours (for reference see http://www.machinelearning.org/archive/ 
icm12009/papers/218.pdf). 


Both pretraining and new activation functions help solve the problem of the 
vanishing gradient. New activation functions offer better derivative functions, and 
pretraining helps start a neural network with better initial weights that require just 
a few adjustments in the latter parts of the network. Advanced pretraining tech- 
niques such as Restricted Boltzanman Machines (https://en.wikipedia.org/ 
wiki/Restricted_Boltzmann_machine), Autoencoders (https://en.wikipedia. 
org/wiki/Autoencoder), and Deep Belief Networks (https://en.wikipedia.org/ 
wiki/Deep_belief_network) elaborate data in an unsupervised fashion by estab- 
lishing initial weights that don’t change much during the training phase of a deep 
learning network. Moreover, they can produce better features representing the data 
and thus achieve better predictions. 


Given the high reliance on neural networks for image recognition tasks, deep 
learning has achieved great momentum thanks to a certain type of neural net- 
work, the convolutional neural networks. Discovered in the 1980s, such networks 
now bring about astonishing results because of the many deep learning additions 
(for reference, see http://rodrigob.github.io/are_we_there_yet/build/ 
classification_datasets_results.html). 


To understand the idea behind convolutional neural networks, think about the 
convolutions as filters that, when applied to a matrix, transform certain parts of 
the matrix, make other parts disappear, and make other parts stand out. You can 
use convolution filters for borders or for specific shapes. Such filters are also use- 
ful for finding details in images that determine what the image shows. Humans 
know that a car is a car because it has a certain shape and certain features, not 
because they have previously seen every type of cars possible. A standard neural 
network is tied to its input, and if the input is a pixel matrix, it recognizes shapes 
and features based on their position on the matrix. Convolution neural networks 
can elaborate images better than a standard neural network because 
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>> The network specializes particular neurons to recognize certain shapes 
(thanks to convolutions), so that same capability to recognize a shape doesn't 
need to appear in different parts of the network. 


>» By sampling parts of an image into a single value (a task called pooling), you 
don't need to strictly tie shapes to a certain position (which would make it 
impossible to rotate them). The neural network can recognize the shape in 
every rotation or distortion, thus assuring a high capacity of generalization of 
the convolutional network. 


Finally, drop-out is a new type of regularization that is particularly effective with 
deep convolutional networks, but it also works with all deep learning architec- 
tures, which acts by temporarily and randomly removing connections between the 
neurons. This approach removes connections that collect only noise from data 
during training. Also, this approach helps the network learn to rely on critical 
information coming from different units, thus increasing the strength of the cor- 
rect signals passed along the layers. 
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IN THIS CHAPTER 


Revisiting the nonlinear separability 
problem 


Getting into the basic formulation of a 
linear SVM 


Grasping what kernels are and what 
they can achieve 


Discovering how R and Python 
implementations of SVMs work 


Going a Step beyond 
Using Support Vector 
Machines 


ometimes ideas sprout from serendipity, and sometimes from the urgent 

necessity to solve relevant problems. Knowing the next big thing in machine 

learning isn’t possible yet because the technology continues to evolve. 
However, you can discover the tools available that will help machine learning 
capabilities grow, allowing you to solve more problems and feed new intelligent 
applications. 


Support vector machines (SVMs) were the next big thing two decades ago, and 
when they appeared, they initially left many scholars wondering whether they’d 
really work. Many questioned whether the kind of representation SVMs were 
capable of could perform useful tasks. The representation is the capability of 
machine learning to approximate certain kinds of target functions expected at the 
foundations of a data problem. Having a good representation of a problem implies 
being able to produce reliable predictions on any new data. 
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SVMs not only demonstrate an incredible capability of representation but also a 
useful one, allowing the algorithm to find its niche applications in the large (and 
growing) world of machine learning applications. You use SVMs to drive the algo- 
rithm, not the other way around. This chapter helps you discover this fact math- 
ematically by using a mind-blowing algorithm that, through complex calculations, 
can solve important problems in image recognition, medical diagnosis, and tex- 
tual classification. 


Revisiting the Separation Problem: 
A New Approach 
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As discussed in Chapter 12, when talking about the perceptron, nonseparability of 
classes can be a problem when you try to classify the examples of two classes in 
an exact manner: There is no straight line that traces a precise border between 
different examples. The set of machine learning algorithms that you choose offers 
some options in such occurrences: 


>> K-Nearest Neighbors: Adapts to nonlinear boundaries between classes when 
using a small k. 


>> Logistic regression: Solves the problem by estimating a probability of being 
in a particular class, thus allowing an estimate even if distinguishing between 
classes isn’t possible due to partial overlap. 


>> Transforming the features: Solves the problem by employing both feature 
creation (adding human creativity and knowledge to the learning process) and 
automatic polynomial expansion (creating power transformations and 
interactions) to find a new set of features used to distinguish classes by 
means of a straight line. 


Decision trees (discussed in Chapter 18) naturally don’t fear any nonlinearity 
because the algorithm builds their classification boundaries using multiple rules 
that can easily approximate complex curves. The same is true of neural networks, 
which naturally create feature transformations and nonlinear approximations by 
making different strata of neurons connect (at the price of high estimate variability 
when the training process isn’t handled carefully). Given an already large range of 
possible options, you may wonder why it was necessary to create another type of 
machine learning algorithm, such as SVM, to solve the nonseparability problem. 


The lack of success that people have had in searching for a master algorithm that 
learns most problems means that you do have to believe in the no-free-lunch 
theorem and recognize that you can’t deem one algorithm better than the other. 
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The kind of features you’re dealing with and the kind of problem you need to solve 
are what determine which algorithms work better. Having access to one more 
effective learning technique is like getting an extra weapon to fight difficult data 
problems. 


Moreover, SVMs feature a range of characteristics that can make SVMs quite 
appealing for many data problems: 


>> A comprehensive family of techniques for binary and multiclass classification, 
regression, and detection of anomalous or novelty data 


>» Robust handling of overfitting, noisy data, and outliers 


>> A capability to handle situations with many variables (and SVMs are still 
effective when you have more variables than examples) 


>» Easy and timely handling of up to about 10,000 training examples 


>> Automatic detection of nonlinearity in data, so that applying transformations 
directly to variables and feature creation isn’t necessary 


In particular, the last characteristic is effective thanks to the availability of special 
functions, the kernel functions. The special capability of kernel functions is to 
map the original feature space into a new feature space reconstructed to achieve 
better classification of regression results. It’s similar to the polynomial expansion 
principle (one available kernel function provides polynomial expansion support), 
but the mathematics behind it require fewer computations, allowing the algo- 
rithm to map complex response functions in less time and with more precision. 


Explaining the Algorithm 


SVMs, a kind of algorithm used with Random Forests and Gradient Boosting 
Machines, are the creation of the mathematician Vladimir Vapnik and of some of 
his colleagues working at AT&T laboratories in the 1990s (such as Boser, Guyon, 
and Cortes). Even though many machine learning experts were initially skeptical 
of the algorithm because it didn’t resemble any other then-existing approach, 
SVMs quickly gained momentum and success thanks to their performance in 
many image-recognition problems, such as handwritten input, that were chal- 
lenging the machine learning community of the time. 


Today, SVMs see widespread use among data scientists, who apply it to an incred- 
ible array of problems, from medical diagnosis to image recognition and textual 
classification. The technique is somewhat limited in its applicability to big data 
because of a lack of scalability when examples and features are too numerous. 
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FIGURE 17-1: 


Comparing the 
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different 
approaches. 


The mathematics are a bit complex, but the idea that started everything is quite 
simple. With this in mind, the chapter uses easy examples and demonstrations to 
demonstrate the basic intuitions and mathematical knowledge. 


We start by looking at the problem of separating two groups with a line. You rarely 
see a similar situation in real-life data, but it’s the basic classification problem in 
machine learning, and many algorithms, such as the perceptron, are built from it. 
Without transforming the feature space of a hypothetical data matrix (made of 
two features, x1 and x2), Figure 17-1 shows how to solve the problem using (clock- 
wise from upper left) a perceptron, a logistic regression, and an SVM. 





re) Perceptron Oo Logistic Regression 
(e) 


























Interestingly, although the perceptron aims only at separating the classes, and 
the mass of the points clearly influences the logistic regression, the separating 
line drawn by an SVM has clear and defined characteristics. By observing the line 
differences, you can see that the best line separating the classes is the one with 
the largest space between the line itself and the examples on the fringes of the 
classes, which is the solution from SVM. Using SVM terminology, the separating 
line is the one with the largest margin (the empty space between the boundaries of 
the classes). SVM places the separating line in the middle of the margin (described 
in textbooks as maximum margin or optimal margin hyperplane) and the examples 
near the line (thus being part of the boundaries) are support vectors. 
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REMEMBER 


Support vectors are actually examples represented as points in the feature space. 
In fact, in order to define the position of a point in a space, you have to state its 
coordinates in each dimension as a series of numbers, which is a vector. Moreover, 
you can view them as support vectors because the largest margin hyperplane 
depends on them — and changing any of them mutates the margin and the hyper- 
plane defining it. 


Margin, separating hyperplane, and support vectors are key terms used to define 
how an SVM works. Support vectors decide the best margin possible, both in the 
case of linearly separable and nonseparable classes. 


The strategy of an SVM is quite simple: By looking for the largest separating mar- 
gin, it takes into account that the algorithm derives a classification function from 
a sample of observations, but that it can’t rely on those observations. Sampled 
values do change, sometimes even greatly from sample to sample. Consequently, 
you can’t rely on a just fit approach based on a single sample to understand all 
possible data variations (the way a perceptron does, for example). Because you 
can’t know whether a successive sample will be similar to the one used for learn- 
ing, keeping the fattest margin allows SVM flexibility in the feature space when 
working with successive samples. In addition, by looking at the class boundaries, 
SVM isn’t influenced by distant points (contrary to a linear regression). The algo- 
rithm determines the margin using only the examples placed on the boundaries. 


Getting into the math of an SVM 


In terms of a mathematical formulation, you can define an SVM by starting with a 
perceptron formulation and enforcing certain constraints, that is, trying to obtain 
a separating line according to certain rules we are going to describe. Because the 
algorithm divides classes in a feature space using a separating line (or hyperplane, 
when you have more than two dimension features), ideally, extending the percep- 
tron formula, the following expression would be true for every example i: 


y(x"w+b)>M 


In this expression, the vector-vector multiplication of the transpose of the feature 
vector x by a coefficient vector w is summed with a constant bias (b). It provides 
a single value whose sign is indicative of the class. Because y can be only —1 or +1, 
you can have a value equal to or more than zero when the operations between 
parentheses have guessed the same sign as y. In the preceding expression, M is a 
constraint representing the margin: It is positive and the largest possible value to 
assure the best separation of predicted classes. 
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REMEMBER 


GETTING DEEPER INTO THE MATH 


Understanding the underlying SVM math is a critical point in the formulation. It's a bit 
beyond the scope of this book to detail the mathematical passages for making the 
cost function of a perceptron into a distance. The paper “Support Vector Machines” 
(by Hearst and others), is available athttp: //www.svms.org/tutorials/ and 
provides much more detail. At the same web address, you can find other interesting 
intermediate and advanced tutorials, too. 


Linear algebra properties assure you that the preceding expression represents for 
each example i the distance from the separating hyperplane and that such dis- 
tance is equal or more than M, the margin. In the end, SVM is based on an optimal 
distance calculation between examples and the hyperplane. 


When you understand the preceding formula as a distance formulation, M natu- 
rally becomes the margin. The target of the optimization then becomes one of 
finding the parameters (the weights w) that correctly predict every example using 
the largest value of M possible. 


Solving the set of formulations required by an SVM, given such constraints (that 
is, having the largest M possible and correctly predicting the training examples), 
requires that you use a quadratic optimization. You solve the quadratic program- 
ming problem differently from the gradient descent approach discussed so far in 
the book. Apart from the math difference, the solution is also more complex, 
requiring the algorithm to solve more computations. Complexity in an SVM 
depends on the number of features and the number of examples. In particular, the 
algorithm scales badly to the number of cases, because the number of computa- 
tions is proportional to the number of examples, raised to the second or third 
power (depending on the problem type). For this reason, SVMs can scale well up 
to 10,000 examples, but beyond this limit, the computer time required for a solu- 
tion may become too demanding. 


Even though they are created for classification, SVMs can also perform regression 
problems well. You use the prediction error as distance and the separating hyper- 
plane surface values for determining the right prediction value at each feature 
value combination. 


Avoiding the pitfalls of nonseparability 


You need to consider a final issue concerning the SVM optimization process previ- 
ously described. Because you know that classes are seldom linearly separable, 
it may not always be possible for all the training examples to force the distance 
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to equal or exceed M. In this case, it’s necessary to consider the following 
expression: 


y(x'w+b)>M(1+e,) 


Each example has an epsilon value that’s always major or equal to zero, in sucha 
way that epsilon turns into a value correcting the wrong classification of specific 
vectors. If an example is correctly classified, then epsilon is 0. Values of epsilon 
between 0 and 1 indicate that the example is on the right side of the separating 
line, yet it entered the margin. Finally, when epsilon is above 1, the case is mis- 
classified and is located on the other side of the optimal separating hyperplane. 


Epsilon allows the algorithm to correct mismatched cases so that the optimization 
process will accept them. You can define the extent of this correction by specifying 
that the sum of all epsilons should be less than a value, C, which acts as a way to 
create a solution that is more affected by bias (if the value of C is small) or vari- 
ance (if the value of C is large). The value of C approximately indicates the number 
of mismatches that the SVM can ignore when learning. 


The C parameter is very important, if not the most important parameter of all in 
an SVM. Epsilon can create exceptions in classification and allow a more realistic 
fit of the SVM formulation when data is imprecise and noisy. However, if you 

reMemBeR introduce too much correction (thus, C is too large), the search for the optimal 
margin will accept too many exceptions, creating an overfitting to data. On the 
other hand, if C is too little, the SVM will look for a complete separation of the 
points, which results in an unsuccessful suboptimal hyperplane whose margin is 
too small. 


Be sure not to confuse the C value in the preceding formulation with the C param- 
eter required by SVM software implementations. Both R and Python implementa- 
tions of SVMs accept a C parameter as a cost for a mismatch. As a software 

TIP parameter, a high C value leads to a smaller margin, because costs are higher 
when an SVM classifies incorrectly. On the other hand, a low C implies a larger 
margin, open acceptance of examples wrongly predicted, and thus an increased 
variance of estimates. 


Applying Nonlinearity 


SVMs are quite mathematically demanding. Up to now, you’ve seen some formu- 
lations that help you understand that SVMs are an optimization problem that 
strives to classify all the examples of two classes. When solving the optimization 
in SVMs, you use a partitioning hyperplane having the largest distance from the 
class boundaries. If classes aren’t linearly separable, the search for the optimal 
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FIGURE 17-2: 
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line (right). 


separating hyperplane allows for errors (quantified by the value of C) in order to 
deal with noise. 


In spite of allowing a small cost for errors, the SVM’s linear hyperplane can’t 
recover nonlinear relationships between classes unless you transform the features 
appropriately. For instance, you can correctly classify only a part of the examples 
like the ones depicted in Figure 17-2 if you don’t transform the existing two 
dimensions into other dimensions using multiplication or powers. 
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xl (x1^2)+(x2^2) 


In other words, you map the existing features onto a feature space of higher 
dimensionality in hopes of finding a way to separate the classes linearly. This is 
the same process shown in Chapter 15 when trying polynomial expansion (where 
you discovered how a linear model captures nonlinear relationships between vari- 
ables automatically). Automatic feature creation using polynomial expansion, 
which is also possible for SVMs, has some important limits: 


>> The number of features increases exponentially, making computations 
cumbersome and saturating the in-memory dataset. (Some datasets cannot 
be expanded beyond a power of 2 or 3.) 


>> The expansion creates many redundant features, causing overfitting. 


>» It's difficult to determine the level of expansion where the classes will become 
linearly separable, thus requiring many iterations of expansion and test. 


As a consequence of these limitations, SVM has adopted a different way, called 
kernel functions, to redefine the feature space without occupying more memory or 
increasing the number of computations too much. Kernel functions aren’t magic; 
they rely on algebra calculations, but they are even more mathematically demand- 
ing than SVM. As a means of understanding how they work, you could say that 
kernel functions project the original features into a higher dimensional space by 
combining them in a nonlinear way. They do so in an implicit way because they 
don’t provide SVM with a new feature space to use for learning (as a polynomial 
expansion would do). Instead, they rely on a vector of values that the SVM can use 
directly to fit a nonlinear separating hyperplane. 
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REMEMBER 


TIP 


Kernel functions, therefore, provide the result of a combination of features (pre- 
cisely a dot-product, a multiplication between vectors), without calculating all the 
combinations involved in such result. This is called the kernel trick and is possible 
for SVMs because their optimization processes can be reformulated into another 
form, called the dual formulation (in contrast with the previous formula, called the 
primal formulation). The dual formulation operates directly on the results of the 
kernel functions and not on the original features. 


Kernels aren’t just for SVMs. You can apply them to any machine learning algo- 
rithm using a formulation operating on the dot-product between examples, as the 
SVM dual formulation does. 


If you would like to know more about the dual formulation and how it differs 
from the primal one, you can find more details in this Quora question at https: // 
www. quora.com/Support—Vector—Machines/Why-is-solving-—in-the-dual- 
easier—than-solving-in-the-primal, or more formally in this tutorial from 
Microsoft Research at http://research.microsoft.com/en—us/um/people/ 
manik/projects/trade-off/svm.html. 


Demonstrating the kernel trick by example 


In this section, as an example of how a kernel function operates, you see how to 
make an explicit and implicit transformation by using Python. The cost function 
is a dual one and operates on dot-products of examples. The dataset is minimal: 
two examples with three features. The goal is to map our existing features to a 
higher dimensionality, using all the possible combinations between the features. 
So if an example is made of the feature vector of values (1,2,3), projecting it this 
way results in a new vector of values (1, 2, 3, 2, 4, 6, 3, 6, 9). The projection grows 
the dimensionality from three to nine features, occupying three times more com- 
puter memory. 


import numpy as np 
xenp array 1,23], (8,2, 4] 1) 
def poly_expansion(A): 
return np.array([[x*y for x in 
row for y in row] for row in A]) 


poly_X = poly_expansion(X) 

print ('Dimensions after expanding: %s' 
% str(poly_X.shape) ) 

print (poly_X) 


Dimensions after expanding: (2, 9) 
fla 232463 6 3] 
[963642 321]] 
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REMEMBER 


The code, using the poly_expansion function, iterates the X matrix (the dataset 
of two examples and three features), expanding the number of features to nine. At 
this point, you can compute the dot-product. 


np.dot(poly_X[@],poly_X[1]) 
100 


By using a kernel function instead, you can simply call the function after provid- 
ing the two features vectors of the examples. You obtain the result without having 
to create a new dataset. 


def poly_kernel(a, b): 
return np.sum(axb)«*2 


poly_kernel(X[@], X[1]) 
100 


Kernel functions are convenient mapping functions that allow SVMs to obtain a 
transformed dataset of limited size, which is equivalent to a more complicated 
and data-intensive nonlinear transformation. Computationally accessible by most 
computers in terms of processing and memory, kernel functions allow you to 
automatically try solving a data problem using a nonlinear separating hyperplane 
without involving any particular human intervention in feature creation. 


Discovering the different kernels 


In its implementations in both R and Python, SVM offers quite a large range of 
nonlinear kernels. Here is a list of the kernels and their parameters: 


>> Linear: No extra parameters 

>» Radial Basis Function: Shape parameters: gamma 

>> Polynomial: Shape parameters: gamma, degree, and coe f@ 
>» Sigmoid: Shape parameters: gamma and coef® 


>> Custom-made kernels: Depends upon the kernel 


Even though the choice is large and possibly larger if you start designing your own 
custom kernel, in practice, using just one of them, the Radial Basis Function 
(RBF), is common because it’s faster than other kernels. In addition, it can map 
and approximate almost any nonlinear function if you tweak its shape parameter, 
gamma. 
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FIGURE 17-3: 

An RBF kernel 
that uses diverse 
hyper-parameters 
to create unique 
SVM solutions. 
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REMEMBER 


The RBF works in a simple but smart way. It creates a margin around every sup- 
port vector — drawing bubbles in the feature space, as shown in Figure 17-3. 
Then, according to the gamma hyper-parameter value, it expands or restricts the 
volume of the bubbles so that they fuse with each other and shape classification 
areas. The gamma value plays the role of the radius of all the bubbles that RBF cre- 
ated. The resulting margin and the hyperplane passing through it will conse- 
quently show very curvy boundaries, demonstrating that it can be quite flexible, 
as in the examples provided by Figure 17-3. 


























The RBF kernel can adapt itself to different learning strategies, resulting in a 
bended hyperplane when C, the error cost, is high, and creating a smoother curve 
line when C is low. It can also fit complex shapes, such as the bull’s eye, when a 
class is placed inside another. All this flexibility comes at the expense of a larger 
estimate variance, but it also detects complex classification rules that other algo- 
rithms may fail to find. 


When tuning an RBF kernel, first fix C, the error cost, to define a bended separat- 
ing hyperplane; then tune gamma to make the margin shape rough and broken 
when the hyper-parameter gamma is low or regular and fused into large bubble- 
shaped areas when it is high. 


The polynomial and sigmoid kernels aren’t as adaptable as RBF, thus showing 
more bias, yet they are both nonlinear transformations of the separating hyper- 
plane. If the sigmoid features a single bend, the polynomial function can have as 
many bends or separated hyperplanes as its set degree. The higher the degree, the 
longer it takes to compute. 


You have many shape values to fix when using the sigmoid and polynomial ker- 
nels: gamma and coef@ for both and degree for polynomial. Because determining 
the effects that different values have on these parameters is difficult, you need to 
test different value combinations, using a grid search, for instance, and evaluate 
the results on a practical basis. Figure 17-4 provides visual representations of 
what you can do with the polynomial and the sigmoid kernels. 
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FIGURE 17-4: 

A sigmoid (left) 
and a polynomial 
kernel (right) 
applied to the 
same data. 


TIP 























In spite of the possibility of creating your own kernel function, most data prob- 
lems are easily solved using the RBF. Simply look for the right combination C and 
gamma by trying different values systematically until you achieve a best result for 
your validation set or cross-validation procedure. 


Illustrating Hyper-Parameters 


Although SVMs are complex learning algorithms, they are versatile and easy to 
use when implemented by a Python class or an R function. Interestingly, both 
Scikit-learn and the R e1071 libraries (developed by the TU Wien E1071 group on 
probability theory) rely on the same external C++ library (with a C API to interface 
with other languages) developed at the National Taiwan University. You can find 
more on the shared LIBSVM for SVM classification and regression at http://www. 
csie.ntu.edu.tw/-cjlin/libsvm/. The Python implementation also uses the 
LIBLINEAR C library from the same authors at Taiwan University, which special- 
izes in classification problems using linear methods on large and sparse datasets 
(see more at http: //www.csie.ntu.edu.tw/-cjlin/liblinear/). 


Because both Python and R wrap the same C library, they offer the same function- 
ality with the same hyper-parameters. Tables 17-1 and 17-2 show a complete 
overview of both classification and regression SVM across the two languages. 





TABLE 17-1 Software Implementations for Classification 
Python/R implementation Purpose Hyper-parameters 
Python: sklearn.svm.SVC The LIBSVM implementation for C (cost in R), kernel, degree, 
aaa binary and multiclass linear and gamma, coef0 
R: svm(type= “C-classification”) kernel classification 
Python: sklearn.svm.NuSVC Same as above nu, kernel, degree, gamma. coefo 


R: svm(type="nu-classification”) 





308 PART 4 Learning from Smart and Big Data 


Python/R implementation Purpose Hyper-parameters 


Python: sklearn.svm. Unsupervised detection nu, kernel, degree, gamma, coef0 


OneClassSVM 


of outliers 


R: svm(type="one-classification”) 





Python: sklearn.svm.LinearSVC Based on LIBLINEAR, it is a binary Penalty, loss, C (cost in R) 


and multiclass linear classifier 





TABLE 17-2 Regression Implementations 
Class Purpose Hyper-parameters 
Python: sklearn.svm.SVR The LIBSVM implementation C (cost in R), kernel, degree, 
for regression gamma, epsilon, coef0 
R: svm(type="eps-regression”) 
Python: sklearn.svm.NuSVR Same as preceding item nu, C (cost in R), kernel, degree, 


gamma, coef0 


R: svm(type="nu-regression”) 


In previous sections, the chapter discusses most of the hyper-parameters, such as 
C, kernel, gamma, and degree, when describing how an SVM algorithm works in 
its basic formulation and with kernels. You may be surprised to find that two new 
SVM versions exist for both classification and regression algorithms; one of them 
is based on the Nu hyper-parameter. The only differences with the Nu version are 
the parameters it takes and the use of a slightly different algorithm. Using either 
SVM version gets the same results, so you normally choose the non-Nu version. 


Python also offers a linear version of SVM, LinearSVC, which is more like the lin- 
ear regression model than an SVC. It also permits regularization and provides a 
fast SVM classification that works well with sparse textual matrices. You see this 
implementation again in Part 5 when dealing with text classification. 


Classifying and Estimating with SVM 


As an example of how you can use an SVM to work out a complex problem, this 
section demonstrates a handwritten recognition task and solves it using a nonlin- 
ear kernel, the RBF. The SVM algorithm learns from the digits dataset available 
from the module datasets in the Scikit-learn package. The digits dataset contains 
a series of 8-x-8 grayscale pixel images of handwritten numbers ranging from 
o to 9. The problem is quite simple when compared to many problems that image 
recognition engines solve today, but it helps you grasp the potential of the 
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learning approach. Even though the example relies on Python, you could create a 
similar example using R because both languages rely on the same LIBSVM library 
working behind the scenes. The example output appears in Figure 17-5. 


import matplotlib.pyplot as plt 

import matplotlib as mpl 

from sklearn.datasets import load_digits 
import numpy as np 

import random 


digits = load_digits() 
X,y = digits.data, digits.target 


%matplotlib inline 
random_examples = [random.randint(@,len(digits. images) ) 
for i in range(1@)] 

for n,number in enumerate(random_examples) : 
plt.subplot(2, 5, n+) 
plt.imshow(digits.images[number],cmap='binary', 

interpolation='none', extent=[0,8,0,8] ) 

plt.grid() 

plt.show() 









012345678 01 2345678 012345678 012345678 012345678 


FIGURE 17-5: E 
A sample from 
the MNIST 
handwritten digit : 
dataset provided a 


by Scikit-learn. 012345678 012345678 012345678 012345678 012345678 








The example code randomly prints a few samples so that you can see an example 
of the different handwritten styles found in the dataset. Your printed examples 
may differ from those found in the book. The code renders the graphic informa- 
tion from a series of numbers, placed on a vector, each one pointing to a pixel in 
the image. The algorithm learns that if certain pixels activate together, the repre- 
sented image corresponds to a particular number. 
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REMEMBER 


This example demonstrates the idea espoused by the Analogizers tribe: Certain 
patterns correspond to answers based on an analogy. SVMs tend to hide this fact 
because they seem to provide a linear combination of weighted features, just as 
linear regression does, but recall that when applying kernel functions, optimiza- 
tion is based on the dot products of examples. Dot products are a way to estimate 
distance between points in order to divide the features space into homogeneous 
partitions having the same class inside. Distance is a way to establish similarities 
in a space made of features. Then SVMs works on similarity, associating points 
and patterns based on distance. 


To verify the results from the model correctly, you first extract 30 percent of the 
examples to use as a test set (an out-of-sample). You also want to prepare the 
features by rescaling them in the range from -1 to +1 to ease the SVM computa- 
tions. The algorithm first learns the parameters from the transformation of the 
training set alone and only after they are applied to the test set so that you avoid 
any kind of out-of-sample information leakage. 


A critical action to take before feeding the data into an SVM is scaling. Scaling 
transforms all the values to the range between -1 to 1 (or from 0 to 1, if you pre- 
fer). Scaling transformation avoids the problem of having some variables influ- 
ence the algorithm and makes the computations exact, smooth, and fast. 


The code begins by adopting an SVM with a nonlinear kernel. To check the useful- 
ness of the representation by the machine learning algorithm, the example uses 
the accuracy score (the percentage of correct guesses) as a measure of how good 
the model is). 


from sklearn.cross_validation import train_test_sp1lit 
from sklearn.cross_validation import cross_val_score 
from sklearn.preprocessing import MinMaxScaler 
# We keep 30% random examples for test 
X_train, X_test, y_train, y_test = train_test_split(X, 
y, test_size=@.3, random_state=1@1 ) 
# We scale the data in the range [-1,1] 
scaling = MinMaxScaler(feature_range=(-1, 1)).fit(X_train) 
X_train = scaling.transform(X_train) 
X_test = scaling.transform(X_test) 


from sklearn.svm import SVC 


svm = SVC() 

cv_performance = cross_val_score(svm, X_train, y_train, 
cv=1@) 

test_performance = svm.fit(X_train, y_train).score(X_test, 


y_test) 
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print ('Cross-validation accuracy score: %@.3f,' 
" test accuracy score: %0.3f' 
% (np.mean(cv_per formance) , test_per formance) ) 


Cross-validation accuracy score: 9.981, 
test accuracy score: 9.985 


After having verified the cross-validation score using the default hyper- 
parameters, the code uses a systematic search to look for better settings that could 
provide a larger number of exact answers. During the search, the code tests dif- 
ferent combinations of linear and RBF together with C and gamma parameters. 
(This example can require a long time to run.) 


from sklearn.grid_search import GridSearchCV 
import numpy as np 
learning_algo = SVC(kernel='linear', random_state=101 ) 
search_space = [{'kernel': ['linear'], 
'C': np.logspace(-3, 3, 7)}, 
{'kernel': ['rbf'], 
'C':np.logspace(-3, 3, 7), 
"gamma': np.logspace(-8, 2, 6)}] 
gridsearch = GridSearchCV(learning_algo, 
param_grid=search_space, 
refit=True, cv=10) 
gridsearch. fit(X_train, y_train) 
print ('Best parameter: %s' 

% str(gridsearch.best_params_) ) 
cv_performance = gridsearch.best_score_ 
test_performance = gridsearch.score(X_test, y_test) 
print ('Cross-validation accuracy score: %@.3f,' 

' test accuracy score: %0.3f' 

% (cv_per formance, test_per formance) ) 


Best parameter: {'kernel': 'rbf', 'C': 1.0, 
"gamma': @.10Q@00@00000000001 } 
Cross-validation accuracy score: 0.988, 
test accuracy score: 0.987 


The computations may take a few minutes, after which the computer reports the 
best kernel, C, and gamma parameters, together with an improved CV score reach- 
ing almost 99 percent accuracy. The accuracy is high, indicating that the com- 
puter can almost distinguish all the different ways to write numbers from 0 to 9. 
As a final output, the code prints the numbers that the SVM wrongly predicted in 
the test set (see Figure 17-6). As humans, we might wonder whether we would 
have performed better than the machine learning algorithm did. 
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FIGURE 17-6: 

The handwritten 
numbers that the 
SVM with RBF 
kernel couldn't 
guess correctly. 


prediction = gridsearch.predict(X_test) 
wrong_prediction = (prediction!=y_test) 
test_digits = scaling. inverse_transform(X_test) 
for n,(number,yp,yt) in enumerate(zip( 
scaling. inverse_transform(X_test) [wrong_prediction], 
prediction[wrong_prediction] , 
y_test [wrong_prediction] )): 
plt.subplot(2, 5, n+) 
plt.imshow(number .reshape((8,8)),cmap='"binary', 
interpolation='none', 
extent=[@,8,2,8] ) 
plt.title('pred: '+str(yp)+"!="+str(yt)) 
plt.grid() 
plt.show() 


pred:8!=6_ pred:1!=8 pred:9!=4_ pred:1!=6_ pred:1!=8 


yas 
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pred:9!=4_ pred:9!=5 


"ipi 


012345678 012345678 


The numbers that the algorithm got wrong are particularly difficult to guess, and 
it’s no wonder that it couldn’t get them all right. Also remember that the dataset 
provided by Scikit-learn is just a portion of the real MNIST dataset, which is avail- 
able at the website http://yann.lecun.com/exdb/mnist/. The full dataset con- 
sists of 60,000 training examples and 10,000 test examples. Using the same SVC 
algorithm and the same settings, the SVC can learn the original dataset, allowing 
the computer to read any handwritten number you present. 
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IN THIS CHAPTER 


Discovering why many guesses are 
better than one 


Making uncorrelated trees work well 
together in Random Forests 


Learning to map complex target 
functions piece by piece using boosting 


Getting better predictions by averaging 
models 


Resorting to 
Ensembles of 
Learners 


fter discovering so many complex and powerful algorithms, you might be 

surprised to discover that a summation of simpler machine learning algo- 

rithms can often outperform the most sophisticated solutions. Such is the 
power of ensembles, groups of models made to work together to produce better 
predictions. The amazing thing about ensembles is that they are made up of 
groups of singularly nonperforming algorithms. 


Ensembles don’t work much differently from the collective intelligence of crowds, 
through which a set of wrong answers, if averaged, provides the right answer. Sir 
Francis Galton, the English Victorian age statistician known for having formu- 
lated the idea of correlation, narrated the anecdote of a crowd in a county fair that 
could guess correctly the weight of an ox after all the people’s previous answers 
were averaged. You can find similar examples everywhere and easily recreate the 
experiment by asking friends to guess the number of sweets in a jar and averaging 
their answers. The more friends who participate in the game, the more precise the 
averaged answer. 
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Luck isn’t what’s behind the result — it’s simply the law of large numbers in action 
(see more at https: //en.wikipedia.org/wiki/Law_of_large_numbers). Even 
though an individual has a slim chance of getting the right answer, the guess is 
better than a random value. By cumulating guesses, the wrong answers tend to 
distribute themselves around the right one. Opposite wrong answers cancel each 
other when averaging, leaving the pivotal value around which all answers are 
distributed, which is the right answer. You can employ such an incredible fact in 
many practical ways (consensus forecasts in economics and political sciences are 
examples) and in machine learning. 


Leveraging Decision Trees 
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Ensembles are based on a recent idea (formulated around 1990), but they leverage 
older tools, such as decision trees, which have been part of machine learning since 
1950. As covered in Chapter 12, decision trees at first looked quite promising and 
appealing to practitioners because of their ease of use and understanding. After 
all, a decision tree can easily do the following: 


>> Handle mixed types of target variables and predictors, with very little or no 
feature preprocessing (missing values are handled almost automatically) 


>> Ignore redundant variables and select only the relevant features 
>> Work out-of-the box, with no complex hyper-parameters to fix and tune 


>> Visualize the prediction process as a set of recursive rules arranged in a tree 
with branches and leaves, thus offering ease of interpretation 


Given the range of positive characteristics, you may wonder why practitioners 
slowly started distrusting this algorithm after a few years. The main reason is that 
the resulting models often have high variance in the estimates. 


To grasp the critical problem of decision trees better, you can consider the prob- 
lem visually. Think of the tricky situation of the bull’s-eye problem that requires 
a machine learning algorithm to approximate nonlinear functions (as neural 
networks do) or to transform the feature space (as when using linear model with 
polynomial expansion or kernel functions in support vector machines). Figure 18-1 
shows the squared decision boundary of a single decision tree (on the left) as 
compared to an ensemble of decision trees (on the right). 


Decision trees partition the feature space into boxes and then use the boxes for 
classification or regression purposes. When the decision boundary that separates 
classes in a bull’s-eye problem is an ellipse, decision trees can approximate it by 
using a certain number of boxes. 
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FIGURE 18-1: 
Comparing a 
single decision 
tree output (left) 
to an ensemble 
of decision 
trees (right). 


TIP 




















The visual example seems to make sense and might give you confidence when you 
see the examples placed far from the decision boundary. However, in proximity of 
the boundary, things are quite different from how they appear. The decision 
boundary of the decision tree is very imprecise and its shape is extremely rough 
and squared. The issue is visible on bidimensional problems. It decisively worsens 
as feature dimensions increase and in the presence of noisy observations (obser- 
vations that are somehow randomly scattered around the feature space). You can 
improve decision trees using some interesting heuristics that stabilize results 
from trees: 


>» Keep only the correctly predicted cases to retrain the algorithm 
>? Build separate trees for misclassified examples 


>» Simplify trees by pruning the less decisive rules 


Apart from these heuristics, the best trick is to build multiple trees using different 
samples and then compare and average their results. The example in Figure 18-1, 
shown previously, indicates that the benefit is immediately visible. As you build 
more trees, the decision boundary gets smoother, slowly resembling the hypo- 
thetical target shape. 


Growing a forest of trees 


Improving a decision tree by replicating it many times and averaging results to get 
a more general solution sounded like such a good idea that it spread, and practi- 
tioners created various solutions. When the problem is a regression, the technique 
averages results from the ensemble. However, when the trees deal with a classifi- 
cation task, the technique can use the ensemble as a voting system, choosing the 
most frequent response class as an output for all its replications. 


When using an ensemble for regression, the standard deviation, calculated from 
all the ensemble’s estimates for an example, can provide you with an estimate of 
how confident you can be about the prediction. The standard deviation shows how 
good a mean is. For classification problems, the percentage of trees predicting a 
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REMEMBER 


REMEMBER 


certain class is indicative of the level of confidence in the prediction, but you can’t 
use it as a probability estimate because it’s the outcome of a voting system. 


Deciding on how to compute the solution of an ensemble happened quickly; find- 
ing the best way to replicate the trees in an ensemble required more research and 
reflection. The first solution is pasting, that is, sampling a portion of your training 
set. Initially proposed by Leo Breiman, pasting reduces the number of training 
examples, which can become a problem for learning from complex data. It shows 
its usefulness by reducing the learning sample noise (sampling fewer examples 
reduces the number of outliers and anomalous cases). After pasting, Professor 
Breiman also tested the effects of bootstrap sampling (sampling with replace- 
ment), which not only leaves out some noise (when you bootstrap, on average you 
leave out 37 percent of your initial example set) but, thanks to sampling repeti- 
tion, also creates more variation in the ensembles, improving the results. This 
technique is called bagging (also known as bootstrap aggregation). 


Bootstrapping appears in Chapter 11 as part of validation alternatives. In boot- 
strapping, you sample the examples from a set to create a new set, allowing the 
code to sample the examples multiple times. Therefore, in a bootstrapped sample, 
you can find the same example repeated from one to many times. 


Breiman noticed that results of an ensemble of trees improved when the trees dif- 
fer significantly from each other (statistically, they’re uncorrelated), which leads 
to the last technique transformation — the creation of mostly uncorrelated 
ensembles of trees. This approach performs predictions better than bagging. The 
transformation tweak samples both features and examples. Breiman, in collabo- 
ration with Adele Cutler, named the new ensemble Random Forests (RF). 


Random Forests is a trademark of Leo Breiman and Adele Cutler. For this reason, 
open source implementations often have different names, such as randomForest 
in R or RandomForestClassifier in Python’s Scikit-learn. 


RF is a classification (naturally multiclass) and regression algorithm that uses a 
large number of decision tree models built on different sets of bootstrapped 
examples and subsampled features. Its creator strove to make the algorithm easy 
to use (little preprocessing and few hyper-parameters to try) and understandable 
(the decision tree basis) that can democratize the access of machine learning to 
nonexperts. In other words, because of its simplicity and immediate usage, RF can 
allow anyone to apply machine learning successfully. The algorithm works 
through a few repeated steps: 


1. Bootstrap the training set multiple times. The algorithm obtains a new set to 
use to build a single tree in the ensemble during each bootstrap. 
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2, Randomly pick a partial feature selection in the training set to use for finding 
the best split variable every time you split the sample in a tree. 


3. Createa complete tree using the bootstrapped examples. Evaluate new 
subsampled features at each split. Don't limit the full tree expansion to allow 
the algorithm to work better. 


4, Compute the performance of each tree using examples you didn't choose in 
the bootstrap phase (out-of-bag estimates, or OOB). OOB examples provide 
performance metrics without cross-validation or using a test set (equivalent to 
out-of-sample). 


5. Produce feature importance statistics and compute how examples associate in 
the tree's terminal nodes. 


6. Compute an average or a vote on new examples when you complete all the 
trees in the ensemble. Declare for each of them the average estimate or the 
winning class as a prediction. 


All these steps reduce both the bias and the variance of the final solution because 
the solution limits the bias. The solution builds each tree to its maximum possible 
extension, thus allowing a fine fitting of even complex target functions, which 
means that each tree is different from the others. It’s not just a matter of building 
on different bootstrapped example sets: Each split taken by a tree is strongly 
randomized — the solution considers only a random feature selection. Conse- 
quently, even if an important feature dominates the others in terms of predictive 
power, the times a tree doesn’t contain the selection allows the tree to find differ- 
ent ways of developing its branches and terminal leaves. 


The main difference with bagging is this opportunity to limit the number of fea- 
tures to consider when splitting the tree branches. If the number of selected fea- 
tures is small, the complete tree will differ from others, thus adding uncorrelated 
trees to the ensemble. On the other hand, if the selection is small, the bias increases 
because the fitting power of the tree is limited. As always, determining the right 
number of features to consider for splitting requires that you use cross-validation 
or OOB estimate results. 


No problem arises in growing a high number of trees in the ensemble. You do need 
to consider the cost of the computational effort (completing a large ensemble 
takes a long time). A simple demonstration conveys how a Random Forests algo- 
rithm can solve a simple problem using a growing number of trees. Both R and 
Python offer good implementations of the algorithm. The R implementation has 
more parameters; the Python’s implementation is easier to parallelize. 
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GETTING MORE RANDOM FORESTS 
INFORMATION 


You can find more information on Random Forests in Python athttp: //scikit- 
learn.org/stable/modules/ensemble.html#forest. The R version appears at 
https: //cran.r-pro ject .org/web/packages/randomF orest/randomForest . 
pdf. For a general discussion of the algorithm, the best resource is the manual by Leo 
Breiman and Adele Cutler at www. stat . berkeley .edu/~breiman/RandomForests/ 


Because the test is computationally expensive, the example starts with the Python 
implementation. This example uses the digits dataset that you used when chal- 
lenging a support vector classifier in the previous chapter. 


import numpy as np 

from sklearn import datasets 

from sklearn.learning_curve import validation_curve 
from sklearn.ensemble import RandomForestClassi fier 


digits = datasets. load_digits() 
X,y = digits.data, digits.target 
series = [10, 25, 50, 100, 150, 200, 250, 300] 
RF = RandomForestClassi fier(random_state=1@1 ) 
train_scores, test_scores = validation_curve(RF, 
X, y, ‘n_estimators', param_range=series, 
cv=1@, scoring='accuracy',n_jobs=—1 ) 


The example begins by importing functions and classes from Scikit-learn: numpy, 
datasets module, validation_curve function, and RandomForestClassi fier. 
The last item is Scikit-learn’s implementation of Random Forests for classifica- 
tion problems. The validation_curve function is particularly useful for the tests 
because it returns the cross-validated results of multiple tests performed on 
ensembles made of differing numbers of trees (similar to learning curves). 


The example will build almost 11,000 decision trees. To make the example run 
faster, the code sets the n_ jobs parameter to —1, allowing the algorithm to use all 
available CPU resources. This setting may not work on some computer configura- 
TIP tions, which means setting n_jobs to 1. Everything will work, but it takes longer. 


After completing the computations, the code outputs a plot that reveals how the 
Random Forests algorithm converges to a good accuracy after building a few trees, 
as shown in Figure 18-2. It also shows that adding more trees isn’t detrimental to 
the results, although you may see some oscillations in accuracy due to estimate 
variances that even the ensemble can’t control fully. 
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FIGURE 18-2: 
Seeing the 
accuracy of 
ensembles of 
different sizes. 


import matplotlib.pyplot as plt 

amatplotlib inline 

plt. figure() 

plt.plot(series, np.mean(test_scores,axis=1), 
plt.xlabel('number of trees' ) 
plt.ylabel('accuracy' ) 

plt.grid() 

plt.show() 
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Understanding the importance measures 


Random Forests have these benefits: 


>> Fit complex target functions, but have little risk in overfitting 


>» Select the features they need automatically (although the random subsam- 
pling of features in branch splitting influences the process) 


>» Are easy to tune up because they have only one hyper-parameter, the 
number of subsampled features 


>» Offer OOB error estimation, saving you from setting up verification by 
cross-validation or test set 


Note that each tree in the ensemble is independent from the others (after all, they 
should be uncorrelated), which means that you can build each tree in parallel to 
the others. Given that all modern computers have multiprocessor and multithread 
functionality, they can perform computations of many trees at the same time, 


which is a real advantage of RF over other machine learning algorithms. 
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TIP 


A Random Forests ensemble can also provide additional output that could be help- 
ful when learning from data. For example, it can tell you which features are more 
important than others. You can build trees by optimizing a purity measure 
(entropy or gini index) so that each split chooses the feature that improves the 
measure the most. When the tree is complete, you check which features the algo- 
rithm uses at each split and sum the improvement when the algorithm uses a 
feature more than once. When working with an ensemble of trees, simply average 
the improvements that each feature provides in all the trees. The result shows you 
the ranking of the most important predictive features. 


Practitioners call importance evaluation gini importance or mean decrease impurity. 
You can compute it in both R and Python algorithm implementations. Another 
way to estimate feature importance is mean decrease accuracy, and you obtain it as 
an output of the randomForest function in R. In this estimation, after the algo- 
rithm builds each tree, it replaces each feature with junk data and records the 
decrease in predictive power after doing so. If the feature is important, crowding 
it with casual data harms the prediction, but if it’s irrelevant, the predictions are 
unchanged. Reporting the average performance decrease in all trees due to ran- 
domly changing the feature is a good indicator of a feature’s importance. 


You can use importance output from Random Forests to select features to use in 
the Random Forests or in another algorithm, such as a linear regression. The 
Scikit-learn algorithm version offers a tree-based feature selection, which provides 
a way to select relevant features using the results from a decision tree or an 
ensemble of decision trees. You can use this kind of feature selection by employing 
the SelectFromModel function found in the feature_selection module (see 
http://scikit-learn.org/stable/modules/generated/sklearn. feature_ 
selection.SelectFromModel .htm1). 


To provide an interpretation of importance measures derived from Random For- 
ests, this example tests the R implementation on the air quality dataset, which 
reports the ozone level in the air of New York from May to September 1973. To 
make the example work, you must install the randomForest library. The following 
R example also uses the caret library, so if you haven’t installed it yet, this is the 
right time to do so. 


install .packages("randomForest" ) 
install.packages("caret" ) 


As a first step, the example uploads the dataset and filters the examples whose 
ozone level data isn’t missing. 


library(caret) 
library(randomForest) 


PART 4 Learning from Smart and Big Data 


TIP 


# Data preparation 

data(airquality, package="datasets" ) 

dataset <- airquality[!(is.na(airquality$0zone) ), ] 
dataset[is.na(dataset)] <- -1 


After filtering, the example finds all the missing dataset values and sets them to -1. 
Because all the predictors (solar radiation, wind, temperature, month, and day) are 
positive, using a negative replacement value tells the Random Forests decision 
trees to split when the information that a value is missing is somehow predictive. 


# Optimizing a tree 
rf_grid <- expand.grid(.mtry=c(2,3,5)) 
rf_model<-train(Ozone ~ ., data=dataset, method="rf", 
trControl=trainControl (method="cv",number=1@), 
metric = "RMSE", 
ntree=500, 
importance = TRUE) 
print (rf_model ) 


The caret package provides a cross-validate check on the mtry hyper-parameter, 
which represents the number of features that each tree in the ensemble considers 
to be possible candidates at each split. Having the RMSE (root mean squared error) 
as a cost function, the caret output specifies the optimal choice is mtry=2. The 
train function of caret also provides the best model, whose importance rankings 
you can question using the importance function. 


# Evaluate the importance of predictors 
print (importance(rf_model$finalModel ) ) 


%IncMSE IncNodePurity 


Solar .R 10.624739 13946 . 509 
Wind 20 . 944030 40084 . 320 
Temp 39 .697999 49512. 349 
Month T . 660438 4TTT .895 
Day 3.911277 9845. 365 


You can read the importance output according to two measures: the cost function 
percentage increase, which is based on testing garbage data using the final model 
performance; and increased node purity, which is based on the internal improve- 
ment of the branch splits of the trees. 


Sometimes the two rankings agree; sometimes they don’t. The cost function 
percentage increase is a sensitivity analysis and is representative of the general 
feature importance. Thus you can use it in other models or for communicating 
insights. Increased node impurity is mostly focused on what the algorithm deems 
important, so it’s excellent for feature selection to improve the ensemble. 
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Working with Almost Random Guesses 
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Thanks to bootstrapping, bagging produces variance reduction by inducing some 
variations in otherwise similar predictors. Bagging is most effective when the 
models created are different from each other and, though it can work with differ- 
ent kinds of models, it mostly performs with decision trees. 


Bagging and its evolution, the Random Forests, aren’t the only ways to leverage 
an ensemble. Instead of striving for ensemble elements’ independence, a totally 
contrarian strategy is to create interrelated ensembles of simple machine learning 
algorithms in order to solve complex target functions. This approach is called 
boosting, which works by building models sequentially and training each model 
using information from the previous one. 


Contrary to bagging, which prefers working with fully grown trees, boosting uses 
biased models, which are models that can predict simple target functions well. 
Simpler models include decision trees with a single split branch (called stumps), 
linear models, perceptrons, and Naïve Bayes algorithms. These models may not 
perform well when the target function to guess is complex (they’re weak learn- 
ers), but they can be trained fast and perform at least slightly better than a random 
lucky guess (which means that they can model a part of the target function). 


Each algorithm in the ensemble guesses a part of the function well, so when 
summed together, they can guess the entire function. It’s a situation not too dif- 
ferent from the story of the blind men and the elephant (https: //en.wikipedia. 
org/wiki/Blind_men_and_an_elephant). In the story, a group of blind men 
needs to discover the shape of an elephant, but each man can feel only a part of 
the whole animal. One man touches the tusk, one the ears, one the proboscides, 
one the body, and one the tail, which are different parts of the entire elephant. 
Only when they put what each one learned separately together can they figure out 
the elephant’s shape. The information for the target function to guess is trans- 
mitted from one model to the other by modifying the original dataset so that the 
ensemble can concentrate on the parts of the dataset that have yet to be learned. 


Bagging predictors with Adaboost 


The first boosting algorithm formulated in 1995 is Adaboost (short for Adaptive 
Boosting) by Yoav Freund and Robert Schapire. Here is the Adaboost formulation: 


H(X)= sign 3 aHhy(X) | 


You may think that the Adaboost formulation is quite complicated at first sight, but 
you can make it simpler by examining it piece by piece. H(X) represents the 
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prediction function, which transforms the features, the X matrix, into predictions. 
As an ensemble of models, the prediction function is a weighted summation of 
models, in a way that resembles the linear models that you examine in Chapter 15. 


The H(X) function provides results as a vector of signs (positive or negative) that 
point out classes in a binary prediction. (Adaboost is a binary prediction algo- 
rithm.) The signs derive from the summation of M models, each one distinguish- 
able by a different m index (the generic model is H,,(X)). M is an integer number 
that you determine when training from data. (Deciding by testing on a validation 
set or using cross-validation is better.) In principle, each model fits a portion of 
the data, and having too many models means fitting the data too well, which is a 
kind of memorization that leads to overfitting, high variance of estimates, and, 
consequently, bad predictions. The number of added models is therefore a critical 
hyper-parameter. 


Note that the algorithm multiplies each model H,,(X) by an alpha value, which 
differs for each model. This is the weight of the model in the ensemble, and alpha 
is devised in a smart way because its value is related to the capacity of the model 
to produce the fewest prediction errors possible. You calculate alpha as follows: 


am = 5 *log((1—err,,) / Erfa) 


Alpha, according to this formulation, gets a larger value as the error of the model 
H,,(X), pointed out by the notation err,,, gets smaller. The algorithm multiplies 
models with fewer errors by larger alpha values and thus such models play a more 
important role in the summation at the core of the Adaboost algorithm. Models 
that produce more prediction errors are weighted less. 


The role of the coefficient alpha doesn’t end with model weighting. Errors output 
by a model in the ensemble don’t simply dictate the importance of the model in 
the ensemble itself but also modify the relevance of the training examples used for 
learning. Adaboost learns the data structure by using a simple algorithm a little at 
a time; the only way to focus the ensemble on different parts of the data is to 
assign weights. Assigning weights tells the algorithm to count an example accord- 
ing to its weight; therefore, a single example can count the same as two, three, or 
even more examples. You can also make an example disappear from the learning 
process by making it count less and less. When considering weights, it becomes 
easier to reduce the cost function of the learning function by working on the 
examples that weigh more (more weight = more cost function reduction). Using 
weights effectively guides the learning process. 


Initially, the examples, as in all the other learning algorithms seen so far, have the 
same contribution in the construction of the model. The optimization happens as 
usual. After creating the first model and estimating a total error, the algorithm 
checks each example to determine whether the prediction is correct. If correctly 
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predicted, nothing happens; each example’s weight remains the same as before. If 
misclassfied, each example has its weight increased, and in the next iteration, 
examples with larger weight influence the model, placing a greater emphasis on 
finding a solution for the larger example. 


At each iteration, the Adaboost algorithm is guided by weights to work on the part 
of data that’s less predictable. In fact, you don’t need to work on data that the 
algorithm can predict well. Weighting is a smart solution for conditioning learn- 
ing, and gradient boosting machines refine and improve the process. Notice that 
the strategy here is different from RF. In RF, the goal is to create independent 
predictions; here, the predictors are chained together because earlier predictors 
determine how later predictors work. Because boosting algorithms rely on a chain 
of calculations, you can’t easily parallelize the computations, so they’re slower. 
You can express the formulation of the weight update in this way: 


W; =W; *exp(am *I(y; #H,,(x,))) 


The I(y,  H,,(x,)) function outputs 0 if the inequality is false and 1 if true. When 
true, the previous example weight is multiplied by the exponential of alpha. The 
algorithm modifies the resulting vector w by overweighting single misclassified 
cases by the most recent learning algorithm in the ensemble. Figuratively, learn- 
ing in such a way is like taking a small improvement step each time toward the 
goal of a working predicting ensemble, and doing so without looking back, because 
after learning algorithms are summed, you can’t change them anymore. 


Remember the kinds of learning algorithms that work well with Adaboost. Usually 
they are weak learners, which mean that they don’t have much predictive power. 
Because Adaboost approximates complex functions using an ensemble of its parts, 
using machine learning algorithms that train quickly and have a certain bias 
makes sense, so the parts are simple. It’s just like drawing a circle using a series 
of lines: Even though the line is straight, all you have to do is draw a polygon with 
as many sides as possible to approximate the circle. Commonly, decision stumps 
are the favorite weak learner for an Adaboost ensemble, but you can also success- 
fully use linear models or Naive Bayes algorithms. The following example lever- 
ages the bagging function provided by Scikit-learn to determine whether decision 
trees, perceptron, or the K-Nearest Neighbour algorithm is best for handwritten 
digits recognition. 


import numpy as np 

from sklearn.ensemble import AdaBoostClassi fier 

from sklearn.tree import DecisionTreeClassi fier 

from sklearn.linear_model import Perceptron 

from sklearn.naive_bayes import BernoulliNB 

from sklearn.cross_validation import cross_val_score 
from sklearn import datasets 
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digits = datasets. load_digits() 
X,y = digits.data, digits.target 


DT = cross_val_score(AdaBoostClassi fier( 
DecisionTreeClassifier(), 
random_state=101) ,X, y, 
scoring='accuracy' ,cv=10) 

P = cross_val_score(AdaBoostClassi fier( 
Perceptron(), random_state=101, 
algorithm='SAMME') ,X, y, 
scoring='accuracy' ,cv=10) 

NB = cross_val_score(AdaBoostClassi fier( 
BernoulliNB(), random_state=101 ) 
,X,y,scoring='accuracy',cv=1@) 


print ("Decision trees: %@.3f\nPerceptron: %@.3f\n" 
"Naive Bayes: %0.3f" % 
(np.mean(DT),np.mean(P), np.mean(NB) ) ) 


You can improve the performance of Adaboost by increasing the number of ele- 
ments in the ensemble until the cross-validation won’t report worsening results. 
The parameter you can increase is n_estimators,and it’s currently set to 50. The 

TIP weaker your predictor is, the larger your ensemble should be in order to perform 
the prediction well. 


Boosting Smart Predictors 


The Adaboost, previously discussed in this chapter, explains how the learning 
procedure creates a function after moving step by step toward a target, an analogy 
that’s similar to the gradient descent described in Chapter 10. This section 
describes the gradient boosting machines (GBM) algorithm, which uses gradient 
descent optimization to determine the right weights for learning in the ensemble. 
The resulting performances are indeed impressive, making GBM one of the most 
powerful predictive tools that you can learn to use in machine learning. Here is the 
GBM formulation: 


F(x) = Sv eh (KW) 


As in Adaboost, you start from the formulation. The GBM formulation requires 
that the algorithm make a weighted sum of multiple models. In fact, what changes 
the most is not the principle of how boosting works but rather the optimization 
process for getting the weight and the power of the summed functions, which 
weak learners can’t determine. 
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In the preceding formula, M represents the number of total models and h repre- 
sents the final function, which is the sum of a series of M models. Each model is 
different, hence the notation h,,, which translates into h1, h2, and so on. The dif- 
ference between the learning functions of the series occurs because the models 
depend on both the features X and the examples weighted by the values of the 
vector w, which actually changes for every model. 


Meeting again with gradient descent 


Up to now, things aren’t all that different from Adaboost. However, note that the 
algorithm weights each model by a constant factor, v, the shrinkage factor. This is 
where you start noticing the first difference between Adaboost and GBM. The fact 
is that v is just like alpha. However, here it’s fixed and forces the algorithm to 
learn in any case, no matter the performance of the previously added learning 
function. Considering this difference, the algorithm builds the chain by reiterating 
the following sequence of operations: 


fr(x) = fna(x) +V *hiGs Wn) 


Look at the formula as it develops during training. After each iteration m, the 
algorithm sums the result of the previous models with a new model built on the 
same features, but on a differently weighted series of examples. This is repre- 
sented in the form of the function h(X,w). The function shows the other difference 
with respect to Adaboost: The vector w isn’t determined by the misclassified 
errors from the previous model but rather derives from a gradient descent opti- 
mization, which assigns weights with respect to a cost function, optionally of 
different kinds. 


GBM can take on different problems: regression, classification, and ranking (for 
ordering examples), with each problem using a particular cost function. Gradient 
descent helps discover the set of vector w values that reduce the cost function. 
This calculation is equivalent to selecting the best examples to use to obtain a bet- 
ter prediction. The algorithm calculates vector w multiple times as the function h 
uses it, and each time, the algorithm adds the resulting function to the previous 
ones. The secret of GBM’s performance lies in weights optimized by gradient 
descent, as well as in these three smart tricks: 


>> Shrinkage: Acts as a learning rate in the ensemble. As in gradient descent, 
you must fix an adequate learning rate to avoid jumping too far from the 
solution, which is the same as in GBM. Small shrinkage values lead to better 
predictions. 


>> Subsampling: Emulates the pasting approach. If each subsequent tree builds 
on a subsample of the training data, the result is a stochastic gradient 
descent. For many problems, the pasting approach helps reduce noise and 
influence by outliers, thus improving the results. 
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>> Trees of fixed size: Fixing the tree depth used in boosting is like fixing a 
complexity limit to learning functions that you put into the ensemble, yet 
relying on more sophisticated trees than the stumps used in Adaboost. Depth 
acts like power in a polynomial expansion: the deeper the tree, the larger the 
expansion, thus increasing both the ability to intercept complex target 
functions and the risk of overfitting. 


Both R and Python implement GBM using all the characteristics described in the 
chapter so far. You can learn more about the R implementation by reading about 
the gbm package at https: //cran.r—-project.org/web/packages/gbm/gbm.pdf. 
Python relies on an implementation in Scikit-learn, which is discussed at http: // 
scikit-learn.org/stable/modules/ensemble.html#gradient—boosting. The 
following example continues the previous test. In this case, you create a GBM 
classifier for the handwritten digits dataset and test its performance (this exam- 
ple may run for a long time): 


import numpy as np 

from sklearn.ensemble import GradientBoostingClassi fier 
from sklearn.cross_validation import cross_val_score 
from sklearn import datasets 

digits = datasets. load_digits() 

X,y = digits.data, digits.target 


GBM = cross_val_score( 
GradientBoostingClassi fier(n_estimators=300, 
subsample=0.8, max_depth=2, learning_rate=0.1, 
random_state=101), X, y, scoring='accuracy' ,cv=10) 


print ("GBM: %@.3f" % (np.mean(GBM) ) ) 


Averaging Different Predictors 


Up to this section, the chapter discusses ensembles made of the same kind of 
machine learning algorithms, but both averaging and voting systems can also 
work fine when you use a mix of different machine learning algorithms. This is 
the averaging approach, and it’s widely used when you can’t reduce the estimate 
variance. 


As you try to learn from data, you have to try different solutions, thus modeling 
your data using different machine learning solutions. It’s good practice to check 
whether you can put some of them successfully into ensembles using prediction 
averages or by counting the predicted classes. The principle is the same as in 
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bagging noncorrelated predictions, when models mixed together can produce less 
variance-affected predictions. To achieve effective averaging, you have to 


TIP 


Divide your data into training and test sets. 
Use the training data with different machine learning algorithms. 


Record predictions from each algorithm and evaluate the viability of the result 
using the test set. 


Correlate all the predictions available with each other. 


Pick the predictions that least correlate and average their result. Or, if you're 
classifying, pick a group of least correlated predictions and, for each example, 
pick as a new class prediction the class that the majority of them predicted. 


Test the newly averaged or voted-by-majority prediction against the test data. 
If successful, you create your final model by averaging the results of the 
models part of the successful ensemble. 


To understand which models correlate the least, take the predictions one by 
one, correlate each one against the others, and average the correlations to 
obtain an averaged correlation. Use the averaged correlation to rank the 
selected predictions that are most suitable for averaging. 
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Applying 
Learning to Real 
Problems 


IN THIS PART... 





| Working with images 
Reading and processing text 
Scoring and classifying text 


Using data to make recommendations 


IN THIS CHAPTER 


Handling images with Python 
Obtaining features from an image 


Using image information to detect 
features from faces 


Performing image classification tasks 
on images of faces 


Classifying Images 


mong the five senses, sight is certainly the most powerful in conveying 

knowledge and information derived from the world outside. Many people 

feel that the gift of sight helps children know about the different things 
and persons around them. In addition, humans receive and transmit knowledge 
across time by means of pictures, visual arts, and textual documents. This chapter 
helps you understand how Python can help your computer interact with images. 


Because sight is so important and precious, it’s similarly invaluable for a machine 
learning algorithm because sight opens the algorithm to new capabilities. Most 
information today is available in digital form (text, music, photos, and videos), 
yet, being able to read visual information in a binary format doesn’t help you to 
understand it and to use it properly. In recent years, one of the more important 
uses of vision in machine learning is to classify images for all sorts of reasons. 
This chapter helps you discover techniques for obtaining image features for use in 
machine learning tasks. 


For example, robots need to know which objects they should avoid and which 
objects they need to work with, yet without image classification, the task is impos- 
sible. Humans also rely on image classification to perform tasks such as hand- 
writing recognition and finding particular individuals in a crowd. Here’s a 
smattering of other vital tasks of image classification: conducting medical scans; 
detecting pedestrians (an important feature to implement in cars and that could 
save thousands of lives); and helping farmers determine where fields need the 
most water. Check out the state of the art in image classification at: http: // 
rodrigob.github. io/are_we_there_yet/build/. The final two sections of this 
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chapter demonstrate techniques for obtaining image information and then using 
that information to perform classification tasks. 


Working with a Set of Images 
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At first sight, image files appear as unstructured data made up of a series of bits. 
The file doesn’t separate the bits from each other in any way. You can’t simply 
look into the file and see any image structure because none exists. As with other 
file formats, image files rely on the user to know how to interpret the data. For 
example, each pixel of a picture file could consist of three 32-bit fields. Knowing 
that each field is 32 bits is up to you. A header at the beginning of the file may 
provide clues about interpreting the file, but even so, it’s up to you to know how 
to interact with the file using the right package or library. 


You use Scikit-image for examples presented in this chapter. It’s a Python package 
dedicated to processing images, picking them up from files, and handling them 
using NumPy arrays. By using Scikit-image, you can obtain all the skills needed to 
load and transform images for any machine learning algorithm. This package also 
helps you upload all the necessary images, resize or crop them, and flatten them 
into a vector of features in order to transform them for learning purposes. 


Scikit-image isn’t the only package that can help you deal with images in Python. 
There are also other packages, such as the following: 


>> scipy.ndimage (http: //docs.scipy.org/doc/scipy/reference/ 
ndimage.htm1): Allows you to operate on multidimensional images 


>> Mahotas (http: //mahotas.readthedocs.org/en/latest/): A fast C++ 
based processing library 


>> OpenCV (https: //opencv-python-tutroals.readthedocs.org/): 
A powerful package that specializes in computer vision 


>> ITK (http://www. itk.org/): Designed to work on 3D images for medical 
purposes 


The example in this section shows how to work with a picture as an unstructured 
file. The example image is a public domain offering from http://commons. 
wikimedia.org/wiki/Main_Page. To work with images, you need to access the 
Scikit-image library (http: //scikit-image.org/), which is an algorithm collec- 
tion used for image processing. You can find a tutorial for this library at http: // 
scipy—lectures.github. io/packages/scikit—image/. The first task is to dis- 
play the image onscreen using the following code. (Be patient: The image is ready 
when the busy indicator disappears from the IPython Notebook tab.) 
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from skimage.io import imread 

from skimage.transform import resize 
from matplotlib import pyplot as plt 
import matplotlib.cm as cm 


amatplotlib inline 


example_file = ("http://upload.wikimedia.org/" + 
"wikipedia/commons/7/7d/Dog_face.png" ) 

image = imread(example_file, as_grey=True) 

plt.imshow(image, cmap=cm.gray) 

plt.show() 


The code begins by importing a number of libraries. It then creates a string that 
points to the example file online and places it in example_file. This string is part 
of the imread() method call, along with as_grey, which is set to True. The as_ 
grey argument tells Python to turn any color images into grayscale. Any images 
that are already in grayscale remain that way. 


After you have an image loaded, you render it (make it ready to display onscreen. 
The imshow() function performs the rendering and uses a grayscale color map. 
The show( ) function actually displays image for you, as shown in Figure 19-1. 


FIGURE 19-1: 

The image 
appears onscreen 
after you render 
and show it. 





Sometimes images aren’t perfect; they can present noise or other granularity. You 
must smooth the erroneous and unusable signals. Filters can help you achieve that 
smoothing without hiding or modifying important characteristics of the image, 
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such as the edges. If you’re looking for an image filter, you can clean up your 
images using the following: 


>> Median filter: Based on the idea that the true signal comes from a median of 
a neighborhood of pixels. A function disk provides the area used to apply the 
median, which creates a circular window on a neighborhood. 


>> Total variation denoising: Based on the idea that noise is variance and this 
filter reduces the variance. 


>> Gaussian filter: Uses a Gaussian function to define the pixels to smooth. 


The following code provides you with an idea of the effect every filter has on the 
final image, with the effects shown in Figure 19-2: 


import warnings 

warnings. filterwarnings("ignore" ) 

from skimage import filters, restoration 

from skimage.morphology import disk 

median_filter = filters.rank.median(image, disk(1)) 
tv_filter = restoration.denoise_tv_chambolle(image, 


weight=0.1) 
gaussian_filter = filters.gaussian_filter(image, 
sigma=0.7) 
Median filter TV filter Gaussian filter 


FIGURE 19-2: 
Different 
filters for 

different noise 
cleaning, 





Don’t worry if a warning appears when you’re running the code. It happens 
because the code converts some number during the filtering process and the new 
numeric form isn’t as rich as before. 


fig = plt. figure() 
for k,(t,F) in enumerate((('Median filter',median_filter), 
(‘TV filter',tv_filter), 
('Gaussian filter', gaussian_filter))): 
f=fig.add_subplot(1,3,k+1) 
plt.axis('off') 
f.set_title(t) 
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FIGURE 19-3: 
Cropping the 
image makes it 
smaller. 


plt.imshow(F, cmap=cm.gray) 
plt.show() 


If you aren’t working in IPython (or you aren’t using the magic command 
%matplotlib inline), just close the image when you’re finished viewing it after 
filtering noise from the image. (The asterisk in the In [*]: entry tells you that the 
code is still running and you can’t move on to the next step.) The act of closing the 
image ends the code segment. You now have an image in memory, and you may 
want to find out more about it. When you run the following code, you discover the 
image type and size: 


print("data type: %s, shape: %s" % 
(type(image), image.shape) ) 


The output from this call tells you that the image type is a numpy.ndarray and 
that the image size is 90 pixels by 90 pixels. The image is actually an array of 
pixels that you can manipulate in various ways. For example, if you want to crop 
the image, you can use the following code to manipulate the image array: 


image2 = image[5:70,0:70] 
plt.imshow(image2, cmap=cm. gray ) 
plt.show() 


The numpy.ndarray in image2 is smaller than the one in image, so the output is 
smaller as well. Figure 19-3 shows typical results. The purpose of cropping the 
image is to make it a specific size. Both images must be the same size for you to 
analyze them. Cropping is one way to ensure that the images are the correct size 
for analysis. 
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Another method that you can use to change the image size is to resize it. The fol- 
lowing code resizes the image to a specific size for analysis: 


image3 = resize(image2, (30, 30), mode='nearest' ) 
plt.imshow(image3, cmap=cm. gray) 


print("data type: %s, shape: %s" % 
(type(image3), image3.shape) ) 


The output from the print() function tells you that the image is now 30 pixels by 
30 pixels in size. You can compare it to any image with the same dimensions. 


After you have cleaned up all the images and made them the right size, you need 
to flatten them. A dataset row is always a single dimension, not two or more 
dimensions. The image is currently an array of 30 pixels by 30 pixels, so you can’t 
make it part of a dataset. The following code flattens image3, so it becomes an 
array of 900 elements stored in image_row. 


image_row = image. flatten() 
print("data type: %s, shape: %s" % 
(type(image_row), image_row.shape) ) 


Notice that the type is still a numpy.ndarray. You can add this array to a dataset 
and then use the dataset for analysis purposes. The size is 900 elements, as 
anticipated. 


Extracting Visual Features 
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Machine learning on images works because it can rely on features to compare 
pictures and associate an image with another one (because of similarity) or to a 
specific label (guessing, for instance, the represented objects). Humans can easily 
choose a car or a tree when we see one in a picture. Even if it’s the first time that 
we see a certain kind of tree or car, we can correctly associate it with the right 
object (labeling) or compare it with similar objects in memory (image recall). 


In the case of a car, having wheels, doors, a steering wheel, and so on are all ele- 
ments that help you categorize a new example of a car among other cars. It hap- 
pens because you see shapes and elements beyond the image itself; thus, no 
matter how unusual a tree or a car may be, if it owns certain characteristics, you 
can figure out what it is. 


An algorithm can infer elements (shapes, colors, particulars, relevant elements, 
and so on) directly from pixels only when you prepare data for it. Apart from 
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special kinds of neural networks, called convolutional networks (discussed in 
Chapter 16 as part of deep learning), which rank as the state of the art in image 
recognition because they can extract useful features from raw images by them- 
selves, it’s always necessary to prepare the right features when working with 
images. 


Feature preparation from images is like playing with a jigsaw — you have to fig- 
ure out any relevant particular, texture, or set of corners represented inside the 
image in order to recreate a picture from its details. All this information serves as 
the image features and makes up a precious element for any machine learning 
algorithm to complete its job. 


Convolutional neural networks filter information across multiple layers, training 
the parameters of their convolutions (kinds of image filters); thus they can filter 
out only the features relevant to the images and the tasks they’re trained to per- 
form. Other special layers, called pooling layers, help the neural net catch these 
features in the case of translation (they appear in unusual parts of the image) or 
rotation. 


Applying deep learning requires special techniques and machines able to sustain the 
heavy computational workload. The Caffe library, developed by Yangqing Jia from 
the Berkeley Vision and Learning Center, allows building such neural networks but 
also leverages existing pretrained ones (http://caffe.berkeleyvision.org/ 
model_zoo.htm1). A pretrained neural network is a convolutional network trained 
on a large number of varied images, thus learning how to filter out a large variety of 
features for classification purpose. The pretrained network lets you input your 
images and obtain a large number of values that correspond to a score on a certain 
kind of feature previously learned by the network as an output. The features may 
correspond to a certain shape or texture. What matters to your machine learning 
objectives is that the most revealing features for your purpose are among those 
produced by the pretrained network, so you must choose the right features by mak- 
ing a selection using another neural network, an SVM, or a simple regression model. 


When you can’t use a convolutional neural network or pretrained library (because 
of memory or CPU constraints), OpenCV (http: //opencv-python-tutroals. 
readthedocs.org/en/latest/py_tutorials/py_feature2d/py_table_of_ 
contents_feature2d/py_table_of_contents_feature2d.html1) or some Scikit- 
image functions can still help. For instance, to emphasize the borders of an image, 
you can apply a simple process using Scikit-image, as shown here: 


from skimage import measure 

contours = measure. find_contours(image, 0.55) 
plt.imshow(image, cmap=cm.gray) 

for n, contour in enumerate(contours): 
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plt.plot(contour[:, 1], contour[:, @], linewidth=2) 
plt.axis('image' ) 
plt.show() 


You can read more about finding contours and other algorithms for feature 
extraction (histograms; corner and blob detection) in the tutorials at http: // 
scikit-image.org/docs/dev/auto_examples/. 


Recognizing Faces Using Eigenfaces 
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The capability to recognize a face in the crowd has become an essential tool for 
many professions. For example, both the military and law enforcement rely on it 
heavily. Of course, facial recognition has uses for security and other needs as well. 
This example looks at facial recognition in a more general sense. You may have 
wondered how social networks manage to tag images with the appropriate label or 
name. The following example demonstrates how to perform this task by creating 
the right features using eigenfaces. 


Eigenfaces is an approach to facial recognition based on the overall appearance of 
a face, not on its particular details. By means of technique that can intercept and 
reshape the variance present in the image, the reshaped information is treated 
like the DNA of a face, thus allowing recovery of similar faces (because they have 
similar variances) in a host of facial images. It’s a less effective technique than 
extracting features from the details of an image, yet it works, and you can imple- 
ment it quickly on your computer. This approach demonstrates how machine 
learning can operate with raw pixels, but it’s more effective when you change 
image data into another kind of data. You can learn more about eigenfaces at 
https://en.wikipedia.org/wiki/Eigenface or by trying the tutorial that 
explores variance decompositions in Scikit-learn at http: //scikit-learn.org/ 
stable/auto_examples/decomposition/plot_faces_decomposition.html. 


In this example, you use eigenfaces to associate images present in a training set 
with those in a test set, initially using some simple statistical measures. 


import numpy as np 

from sklearn.datasets import fetch_olivetti_faces 

dataset = fetch_olivetti_faces(shuffle=True, 
random_state=101 ) 

train_faces = dataset.data[:350, :] 

test_faces = dataset.data[350:,:] 

train_answers = dataset.target[ :350] 

test_answers = dataset.target[350: ] 
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The example begins by using the Olivetti faces dataset, a public domain set of 
images readily available from Scikit-learn. For this experiment, the code divides 
the set of labeled images into a training and a test set. You need to pretend that 
you know the labels of the training set but don’t know anything from the test set. 
As a result, you want to associate images from the test set to the most similar 
image from the training set. 


print (dataset .DESCR) 


The Olivetti dataset consists of 400 photos taken from 40 people (so there are 10 
photos of each person). Even though the photos represent the same person, each 
photo has been taken at different times during the day, with different light and 
facial expressions or details (for example, with glasses and without). The images 
are 64 x 64 pixels, so unfolding all the pixels into features creates a dataset made 
of 400 cases and 4,096 variables. It seems like a high number of features, and 
actually, it is. Using RandomizedPCA, you can reduce them to a smaller and more 
manageable number. 


from sklearn.decomposition import RandomizedPCA 
n_components = 25 
Rpca = RandomizedPCA(n_components=n_components, 
whiten=True, 
random_state=101 ). fit(train_faces) 
print ('Explained variance by %i components: %@.3f' % 
(n_components, 
np.sum(Rpca.explained_variance_ratio_))) 
compressed_train_faces = Rpca.transform(train_faces) 
compressed_test_faces = Rpca.transform(test_faces) 


Explained variance by 25 components: @.794 


The RandomizedPCA class is an approximate PCA version, which works better 
when the dataset is large (has many rows and variables). The decomposition 
creates 25 new variables (n_components parameter) and whitening (whiten=True), 
removing some constant noise (created by textual and photo granularity) and 
irrelevant information from images in a different way from the filters just 
discussed. The resulting decomposition uses 25 components, which is about 
80 percent of the information held in 4,096 features. 


import matplotlib.pyplot as plt 
photo = 17 # This is the photo in the test set 
print ('We are looking for face id=%i' 
% test_answers [photo] ) 
plt.subplot(1, 2, 1) 
plt.axis('off') 
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plt.title('Unknown face '+str(photo)+' in test set') 
plt.imshow(test_faces [photo] .reshape(64,64), 

cmap=plt.cm.gray, interpolation='nearest' ) 
plt.show() 


Figure 19-4 shows the chosen photo, subject number 34, from the test set. 


Unknown face 17 in test set 


F 


FIGURE 19-4: 

The example 
application would 
like to find similar 
photos. 





After the decomposition of the test set, the example takes the data relative only to 
photo 17 and subtracts it from the decomposition of the training set. Now the 
training set is made of differences with respect to the example photo. The code 
squares them (to remove negative values) and sums them by row, which results in 
a series of summed errors. The most similar photos are the ones with the least 
squared errors, that is, the ones whose differences are the least. 


#Just the vector of value components of our photo 
mask = compressed_test_faces[photo, | 
squared_errors = np.sum((compressed_train_faces - 
mask )**2,axis=1 ) 

minimum_error_face = np.argmin(squared_errors) 
most_resembling = list(np.where(squared_errors < 2@)[@]) 
print ('Best resembling face in train test: %i' % 

train_answers[minimum_error_ face] ) 
Best resembling face in train test: 34 


As it did before, the code can now display photo 17, which is the photo that best 
resembles images from the train set. Figure 19-5 shows typical output from this 
example. 


import matplotlib.pyplot as plt 

plt.subplot(2, 2, 1) 

plt.axis('off') 

plt.title('Unknown face '+str(photo)+' in test set') 
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plt.imshow(test_faces [photo] .reshape(64,64), 
cmap=plt.cm.gray, interpolation='nearest' ) 
for k,m in enumerate(most_resembling[:3]): 
plt.subplot(2, 2, 2+k) 
plt.title('Match in train set no. '+str(m)) 
plt.axis('off') 
plt.imshow(train_faces[m] .reshape(64,64), 
cmap=plt.cm.gray, interpolation='nearest' ) 
plt.show() 


Unknown face 17 intest set Match in train set no. 170 


F 





Match in train set no. 191 Match in train set no. 216 


FIGURE 19-5: 

The output shows 
the results that 
resemble the test 
image. 





Even though the most similar photo is similar (it’s just scaled slightly differently), 
the other two photos are quite different. However, even though those photos don’t 
match the test image as well, they really do show the same person as in photo 17. 


Classifying Images 


This section adds to your knowledge of facial recognition, this time applying a 
learning algorithm to a complex set of images, called the Labeled Faces in the Wild 
dataset that contains images of famous people collected over the Internet: http: // 
scikit-learn.org/stable/datasets/labeled_faces.html. You must download 
the dataset from the Internet, using the Scikit-learn package in Python. The pack- 
age mainly contains photos of well-known politicians. 


import warnings 
warnings. filterwarnings("ignore" ) 
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from sklearn.datasets import fetch_1fw_people 

lfw_people = fetch_1fw_people(min_faces_per_person=60, 
resize=0.4) 

X = lfw_people.data 

y = lfw_people.target 

target_names = [lfw_people.target_names[a] for a in y] 

n_samples, h, w = lfw_people. images .shape 

from collections import Counter 

for name, count in Counter(target_names).items(): 

print ("%2@s %i" % (name, count) ) 


Ariel Sharon 77 
Junichiro Koizumi 60 
Colin Powell 236 
Gerhard Schroeder 109 
Tony Blair 144 

Hugo Chavez 71 
George W Bush 530 
Donald Rumsfeld 121 


As an example of dataset variety, after dividing the examples into training and 
test sets, you can display a sample of pictures from both sets depicting Jun’Ichiro 
Koizumi, Prime Minister of Japan from 2001 to 2006. Figure 19-6 shows the 
output of the following code. 


from sklearn.cross_validation import 
StratifiedShuffleSplit 
train, test = list(StratifiedShuffleSplit(target_names, 
n_iter=1, test_size=0.1, random_state=101 ) ) [0] 


plt.subplot(1, 4, 1) 
plt.axis('off' ) 
for k,m in enumerate(X[train] [y[train]==6][:4]): 

plt.subplot(1, 4, 1+k) 

if k==0: 

plt.title('Train set') 
plt.axis('off') 
plt.imshow(m.reshape(5@,37), 
cmap=plt.cm.gray, interpolation='nearest' ) 

plt.show() 


for k,m in enumerate(X[test] [y[test]==6][:4]): 
plt.subplot(1, 4, 1+k) 
if k==0: 
plt.title('Test set') 
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FIGURE 19-6: 
Examples from 
the training and 


test sets do differ 


in pose and 
expression. 


plt.axis('off') 
plt.imshow(m.reshape(5@, 37), 
cmap=plt.cm.gray, interpolation='nearest' ) 
plt.show() 


Train set 


JDDG 
Ge 


As you can see, the photos have quite a few variations, even among photos of the 
same person, which makes the task challenging: expression, pose, different light, 
and quality of the photo. For this reason, the example that follows applies the 
eigenfaces method described in the previous section, using different kinds of 
decompositions and reducing the initial large vector of pixel features (1850) to a 
simpler set of 150 features. The example uses PCA, the variance decomposition 
technique; Non-Negative Matrix Factorization (NMF), a technique for decompos- 
ing images into only positive features; and FastIca, an algorithm for Independent 
Component Analysis, an analysis that extracts signals from noise and other sepa- 
rated signals (the algorithm is successful at handling problems like the cocktail 
party problem described at https: //en.wikipedia.org/wiki/Cocktail_party_ 
effect). 





Test set 


Thy: 


-Á 





from sklearn import decomposition 
n_components = 50 
pca = decomposition.RandomizedPCA( 
n_components=n_components, 
whiten=True).fit(X[train,:]) 
nmf = decomposition.NMF(n_components=n_components, 
init='nndsvda', 
tol=5e-3).fit(X[train, :]) 
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fastica = decomposition.FastICA(n_components=n_components, 

whiten=True). fit(X[train, :]) 

eigenfaces = pca.components_.reshape((n_components, h, w)) 

X_dec = np.column_stack((pca.transform(X[train,:]), 
nmf.transform(X[train,:]), 
fastica.transform(X[train,:]))) 

Xt_dec = np.column_stack((pca.transform(X[test,:]), 
nmf.transform(X[test,:]), 
fastica.transform(X[test,:]))) 

y_dec = y[train] 

yt_dec = y[test] 


After extracting and concatenating the image decompositions into a new training 
and test set of data examples, the code applies a grid search for the best combina- 
tions of parameters for a classification support vector machine (the topic of 
Chapter 17) to perform a correct problem classification. 


from sklearn.grid_search import GridSearchCV 

from sklearn.svm import SVC 

param_grid = {'C': [@.1, 1.0, 10.0, 100.0, 1000.0], 
‘gamma': [@.0@01, 2.001, 0.01, 0.1], } 

clf = GridSearchCV(SVC(kernel='rbf'), param_grid) 

clf = clf.fit(X_dec, y_dec) 

print ("Best parameters: %s" % clf.best_params_) 


Best parameters: {'gamma': @.@1, 'C': 100.0} 


After finding the best parameters, the code checks for accuracy — the percentage 
of correct answers in the test set — and obtains an estimate of about 0.82 (the 
measure may change when you run the code on your computer). 


from sklearn.metrics import accuracy_score 
solution = clf.predict(Xt_dec) 
print("Achieved accuracy: %@.3f" 

% accuracy_score(yt_dec, solution) ) 


Achieved accuracy: 9.815 


More interestingly, you can ask for a confusion matrix that shows the correct classes 
along the rows and the predictions in the columns. When a character in a row has 
counts in columns different from its row number, the code has mistakenly attrib- 
uted one of the photos to someone else. In the case of the former Prime Minister of 
Japan, the example actually gets a perfect score (notice that the output shows a 6 in 
row 6, column 6, and zeroes in the remainder of the entries for that row). 
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from sklearn.metrics import confusion_matrix 
confusion = str(confusion_matrix(yt_dec, solution) ) 
print (' '*26+ ' '.join(map(str,range(8) ))) 


print (' 


"*26+ '-'*22) 


for n, (label, row) in enumerate( 


Nook wWne © 


print ('%s %18s > %s' 


Ariel Sharon 
Colin Powell 
Donald Rumsfeld 
George W Bush 
Gerhard Schroeder 
Hugo Chavez 
Junichiro Koizumi 
Tony Blair 


vvvvvVvvV Vv 


SS ee 


zip(1fw_people.target_names, 
confusion.split('\n'))) 


% (n, label, row)) 





One eacay ae oGn 7 
62101 8 @ Qj 
222 © 2 © © @ Qj 
© ð 8 2 14 ð @ 1] 
1 4 246 1 @ @ 2] 
202161 @ 14] 
202001 5 @ 1] 
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IN THIS CHAPTER 


Considering Natural Language 
Processing (NLP) 


Defining how machines can 
understand text 


Performing scoring and classification 
tasks 


Scoring Opinions and 
Sentiments 


any people have the idea that somehow computers can understand text. 

The fact is that computers don’t even have a way in which to represent 

text — it’s all numbers to the computer. This chapter helps you under- 
stand three phases of working with text to score opinions and sentiments: using 
Natural Language Processing (NLP) to parse the text; performing the actual task 
of understanding the text; and then performing scoring and classification tasks to 
interact with the text meaningfully. 


Introducing Natural Language Processing 


As human beings, understanding language is one our first achievements, and 
associating words to their meaning seems natural. It’s also automatic to handle 
discourses that are ambiguous, unclear, or simply have a strong reference to the 
context of where we live or work (such as dialect, jargon, or terms family or asso- 
ciates understand). In addition, humans can catch subtle references to feelings 
and sentiments in text, enabling people to understand polite speech that hides 
negative feelings and irony. Computers don’t have this ability but can rely on NLP, 
a field of computer science concerned with language understanding and language 
generation between a machine and a human being. Since Alan Turing first devised 
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the Turing Test in 1950, which aims at spotting an artificial intelligence based on 
how it communicates with humans (https: //en.wikipedia.org/wiki/Turing_ 
test), NLP experts have developed a series of techniques that define the state of 
the art in computer-human interaction by text. 


A computer powered with NLP can successfully spot spam in your email, tag the 
part of a conversation that contains a verb or a noun, and spot an entity like the 
name of a person or a company (called named entity recognition; see https: // 
en.wikipedia.org/wiki/Named-entity_recognition). All these achievements 
have found application in tasks such as spam filtering, predicting the stock market 
using news articles, and de-duplicating redundant information in data storage. 


Things get more difficult for NLP when translating a text from another language 
and understanding who the subject is in an ambiguous phrase. For example, con- 
sider the sentence, “John told Luca he shouldn’t do that again.” In this case, you 
can’t really tell whether “he” refers to John or Luca. Disambiguating words with 
many meanings, such as considering whether the word mouse in a phrase refers to 
an animal or a computer device, can prove difficult. Obviously, the difficulty in all 
these problems arises because of the context. 


As humans, we can easily resolve ambiguity by examining the text for hints about 
elements like place and time that express the details of the conversation (such as 
understanding what happened between John and Luca, or whether the conversation 
is about a computer when mentioning the mouse). Relying on additional informa- 
tion for understanding is part of the human experience. This sort of analysis is 
somewhat difficult for computers. Moreover, if the task requires critical contextual 
knowledge or demands that the listener resort to common sense and general exper- 
tise, the task becomes daunting. Simply put, NLP still has a lot of ground to cover 
in order to discover how to extract meaningful summaries from text effectively or 
how to complete missing information from text. 


Understanding How Machines Read 
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Before a computer can do anything with text, it must be able to read the text in 
some manner. Chapter 13 shows how you can prepare data to deal with categorical 
variables, such as a feature representing a color (for instance, representing 
whether an example relates to the colors red, green, or blue). Categorical data is a 
type of short text that you represent using binary variables, that is, variables 
coded using one or zero values according to whether a certain value is present in 
the categorical variable. Not surprisingly, you can represent complex text using 
the same logic. 
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Therefore, just as you transform a categorical color variable, having values such 
as red, green, and blue, into three binary variables, each one representing one of 
the three colors, so you can transform a phrase like “The quick brown fox jumps 
over the lazy dog” using nine binary variables, one for each word that appears in 
the text (“The” is considered distinct from “the” because of its initial capital let- 
ter). This is the Bag of Words (BoW) form of representation. In its simplest form, 
BoW shows whether a certain word is present in the text by flagging a specific 
feature in the dataset. Take a look at an example using Python and its Scikit-learn 
package. 


The input data is three phrases, text_1, text_2, and text_3, placed in a list, cor- 
pus. A corpus is a set of homogenous documents put together for NLP analysis: 


text_1 = 'The quick brown fox jumps over the lazy dog.' 
text_2 = 'My dog is quick and can jump over fences. ' 
text_3 = 'Your dog is so lazy that it sleeps all the day.' 
corpus = [text_1, text_2, text_3] 


When you need to analyze text using a computer, you load the documents from 
disk or scrape them from the web and place each of them into a string variable. If 
you have multiple documents, you store them all in a list, the corpus. When you 
have a single document, you can split it using chapters, paragraphs, or simply the 
end of each line. After splitting the document, place all its parts into a list and 
apply analysis as if the list were a corpus of documents. 


Now that you have a corpus, you use a class from the feature_extraction module 
in Scikit-learn, CountVectorizer, which easily transforms texts into BoW like this: 


from sklearn.feature_extraction import text 

vectorizer = text.CountVectorizer(binary=True) .fit(corpus) 
vectorized_text = vectorizer.transform(corpus) 
print(vectorized_text .todense() ) 


[[020100101000110110001 92] 
[(010101101010011100002] 
[1120001100110010001111 14]] 


The CountVectorizer class learns the corpus content using the fit method and 
then turns it (using the transform method) into a list of lists. As discussed in 
Chapter 9, a list of lists is nothing more than a matrix in disguise, so what the 
class returns is actually a matrix made of three rows (the three documents, in the 
same order as the corpus) and 21 columns representing the content. 


The BoW representation turns words into the column features of a document 
matrix, and these features have a nonzero value when present in the processed text. 
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For instance, consider the word dog. The following code shows its representation in 
the BoW: 


print(vectorizer.vocabulary_) 


day s @, UMS e ad, “chews 18, "the's 1@, “iss &, 
fences (0, lazy 12 andi: al, “euiie@lx”s 15, myi 13; 
"Gein 2 S “mere S "Sove ay, e @, I s 2, 
"dog': 5, '‘jump': 10, ‘over': 14, ‘sleeps': 16, 
"weir": Qe), rors T) 


Asking the CountVectorizer to print the vocabulary learned from text reports 
that it associates dog with the number five, which means that dog is the fifth ele- 
ment in the BoW representations. In fact, in the obtained BoW, the fifth element 
of each document list always has a value of 1 because dog is the only word present 
in all the tree documents. 


Storing documents in a document matrix form can be memory intensive because 
you must represent each document as a vector of the same length as the dictionary 
that created it. The dictionary in this example is quite limited, but when you use a 
larger corpus, you discover that a dictionary of the English language contains well 
over a million terms. The solution is to use sparse matrices. A sparse matrix is a 
way to store a matrix in your computer’s memory without having zero values 
occupying memory space. You can read more about sparse matrices here: https: // 
en.wikipedia.org/wiki/Sparse_matrix. 


Processing and enhancing text 


Marking whether a word is present or not in a text is indeed a good start, but 
sometimes it is not enough. The BoW model has its own limits. As if you were put- 
ting stuff randomly into a bag, in a BoW, words lose their order relationship with 
each other. For instance, in the phrase My dog is quick and can jump over fences, you 
know that quick refers to dog because it is glued to it by the form is of the verb to 
be. In a BoW, however, everything is mixed and some internal references are lost. 
Further processing can help prevent such. The following sections discuss how to 
process and enhance text. 


Considering basic processing tasks 


Instead of marking the presence or absence of an element of the phrase (techni- 
cally called a token), you can instead count how many times it occurs, as shown in 
the following code: 


text_4 = 'A black dog just passed by but my dog is brown.' 
corpus. append(text_4) 
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vectorizer = text.CountVectorizer().fit(corpus) 
vectorized_text = vectorizer.transform(corpus) 
print(vectorized_text.todense()[-1] ) 


[[201111002001000101010000202]] 


This code modifies the previous example by adding a new phrase with the word 
dog repeated two times. The code appends the new phrase to the corpus and 
retrains the vectorizer, but it omits the binary=True setting this time. The 
resulting vector for the last inserted document clearly shows a 2 value in the ninth 
position, thus the vectorizer counts the word dog twice. 


Counting tokens helps make important words stand out. Yet, it’s easy to repeat 
phrase elements, such as articles, that aren’t important to the meaning of the 
expression. In the next section, you discover how to exclude less important ele- 
ments, but for the time being, the example underweights them using the term 
frequency-inverse document frequency (TF-IDF) transformation. 


The TF-IDF transformation is a technique that, after counting how many times 
a token appears in a phrase, divides the value by the number of documents in 
which the token appears. Using this technique, the vectorizer deems a word less 
important, even if it appears many times in a text, when it also finds that word in 
other texts. In the example corpus, the word dog appears in every text. In a clas- 
sification problem, you can’t use the word to distinguish between texts because it 
appears everywhere in the corpus. The word fox appears only in one phrase, mak- 
ing it an important classification term. 


You commonly apply a number of transformations when applying TF-IDF, with 
the most important transformation normalizing the text length. Clearly, a longer 
text has more chances to have more words that are distinctive when compared to 
a shorter text. For example, when the word fox appears in a short text, it can be 
relevant to the meaning of that expression because fox stands out among few 
other words. However, when the word fox appears once in a long text, its presence 
might not matter much because it’s a single word among many others. For this 
reason, the transformation divides the total tokens by the count of each token for 
each phrase. Treating a phrase like this turns token counting into a token 
percentage, so TF-IDF no longer considers how many times the word fox appears, 
but instead considers the percentage of times the word fox appears among all the 
tokens. The following example demonstrates how to complete the previous 
example using a combination of normalization and TF-IDF. 


TfidF = text.TfidfTransformer(norm='11' ) 
tfidf = TfidF.fit_transform(vectorized_text ) 


phrase = 3 # choose a number from @ to 3 
total = @ 
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for word in vectorizer.vocabulary_: 
pos = vectorizer.vocabulary_[word] 
value = list(tfidf.toarray() [phrase] ) [pos] 
if value !=0: 
print ("%1@s: %@.3f" % (word, value) ) 
total += value 
print ('\nSummed values of a phrase: %0.1f' % total) 


is: 0.077 

by: @.121 
brown: @.095 
dog: 0.126 
just: @.121 
my: 0.095 
black: @.121 
passed: @.121 
but: @.124 


Summed values of a phrase: 1.0 


Using this new TF-IDF model rescales the values of important words and makes 
them comparable between each text in the corpus. To recover part of the ordering 
of the text before the BoW transformation, adding n-grams (https://en. 
wikipedia.org/wiki/N-gram) is also useful. An n-gram is a continuous sequence 
of tokens in the text that you use as a single token in the BoW representation. For 
instance, in the phrase The quick brown fox jumps over the lazy dog, a bigram — that 
is, a sequence of two tokens — transforms brown fox and lazy dog into single 
tokens. A trigram may create a single token from quick brown fox. An n-gram is a 
powerful tool, but it has a drawback because it doesn’t know which combinations 
are important to the meaning of a phrase. N-grams create all the contiguous 
sequences of size N. The TF-IDF model can underweight the less useful n-grams, 
but only projects like Google’s NGram viewer (you learn more about this viewer 
later in the chapter) can tell you which n-grams are useful in NLP with any 
certainty. The following example uses CountVectorizer to model n-grams in the 
range of (2, 2), that is, bigrams. 


bigrams = text.CountVectorizer(ngram_range=(2,2)) 
print (bigrams.fit(corpus).vocabulary_) 


{'can jump': 6, 'by but': 5, 'over the': 21, 
‘it sleeps': 13, ‘your dog': 31, ‘the quick': 30, 
‘aie Cain’ 2 a, "So lemyie AS, “iS Gove IA, “Glog iSs"s Tf, 
‘quick brown': 24, ‘lazy dog': 17, ‘fox jumps': 9, 
‘is brown': 10, 'my dog': 19, ‘passed by': 22, 
‘lazy that': 18, 'black dog': 2, ‘brown fox': 3, 
"that it': 27, ‘quick and’: 23, ‘the day’: 28, 
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ON THE WEB 


"just passed': 16, 'dog just': 8, ‘jump over': 14, 
"sleeps all': 25, ‘over fences': 20, ‘jumps over': 15, 
"the lazy': 29, 'but my': 4, 'all the': @, 

‘is quick': 11} 


Setting different ranges lets you use both unigrams (single tokens) and n-grams 
in your NLP analysis. For instance the setting ngram_range=(1,3) creates all 
tokens, all bigrams, and all trigrams. You usually never need more than trigrams 
in an NLP analysis. Increasing the number of n-grams is slightly beneficial after 
trigrams and sometimes even just after bigrams, depending on the corpus size 
and the NLP problem. 


Stemming and removing stop words 


Stemming is the process of reducing words to their stem (or root) word. This task 
isn’t the same as understanding that some words come from Latin or other roots, 
but instead makes similar words equal to each other for the purpose of compari- 
son or sharing. For example, the words cats, catty, and catlike all have the stem cat. 
The act of stemming helps you analyze sentences when tokenizing them because 
words having the same stem should have the same meaning (represented by a 
single feature). 


Creating stem words by removing suffixes to make tokenizing sentences easier 
isn’t the only way to make the document matrix simpler. Languages include many 
glue words that don’t mean much to a computer but have significant meaning to 
humans, such as a, as, the, that, and so on in English. They make the text flow and 
concatenate in a meaningful way. Yet, the BoW approach doesn’t care much about 
how you arrange words in a text. Thus, removing such words is legitimate. These 
short, less useful words are called stop words. 


The act of stemming and removing stop words simplifies the text and reduces the 
number of textual elements so that only the essential elements remain. In addi- 
tion, you keep just the terms that are nearest to the true sense of the phrase. By 
reducing the number of tokens, a computational algorithm can work faster and 
process the text more effectively when the corpus is large. 


This example requires the use of the Natural Language Toolkit (NLTK), which 
Anaconda doesn’t install by default. To use this example, you must download and 
install NLTK using the instructions found at http: //www.nltk.org/install. 
html for your platform. Make certain that you install the NLTK for whatever ver- 
sion of Python you’re using for this book when you have multiple versions of 
Python installed on your system. After you install NLTK, you must also install the 
packages associated with it. The instructions at http: //www.nltk.org/data. 
htm] tell you how to perform this task. (Install all the packages to ensure that you 
have everything.) 
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The following example demonstrates how to perform stemming and remove stop 
words from a sentence. It begins by training an algorithm to perform the required 
analysis using a test sentence. Afterward, the example checks a second sentence 
for words that appear in the first. 


from sklearn.feature_extraction import text 


import nltk 

from nltk import word_tokenize 

from nltk.stem.porter import PorterStemmer 
nltk.download('punkt' ) 


stemmer = PorterStemmer ( ) 


def stem_tokens(tokens, stemmer): 
stemmed = [] 
for item in tokens: 
stemmed. append(stemmer .stem( item) ) 
return stemmed 


def tokenize(text): 
tokens = word_tokenize(text) 
stems = stem_tokens(tokens, stemmer) 
return stems 


vocab = ['Sam loves swimming so he swims all the time’ | 

vect = text.CountVectorizer(tokenizer=tokenize, 
stop_words='english' ) 

vec = vect. fit(vocab) 


sentence1 = vec.transform(['George loves swimming too!']) 


print (vec.get_feature_names()) 
print (sentence1 .toarray()) 


At the outset, the example creates a vocabulary using a test sentence and places it 
in the variable vocab. It then creates a CountVectorizer, vect, to hold a list of 
stemmed words, but it excludes the stop words. The tokenizer parameter defines 
the function used to stem the words. The stop_words parameter refers to a pickle 
file that contains stop words for a specific language, which is English in this case. 
There are also files for other languages, such as French and German. (You can see 
other parameters for the CountVectorizer() at http: //scikit-learn.org/stable/ 
modules/generated/sklearn. feature_extraction.text.CountVectorizer .htm1.) 
The vocabulary is fitted into another CountVectorizer, vec, which is used to 
perform the actual transformation on a test sentence using the transform() 
function. Here’s the output from this example. 
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[nltk_data] Downloading package punkt to 

[nltk_data] C: \Users\Luca\AppData\Roaming\nltk_data... 
[nltk_data] Unzipping tokenizers\punkt.zip. 

['love', 'sam', 'swim', 'time'] 

[[41 © 1 2]] 


The first output shows the stemmed words. Notice that the list contains only 
swim, not swimming or swims. All the stop words are missing as well. For example, 
you don’t see the words so, he, all, or the. 


The second output shows how many times each stemmed word appears in the test 
sentence. In this case, a love variant appears once and a swim variant appears once 
as well. The words sam and time don’t appear in the second sentence, so those 
values are set to 0. 


Scraping textual datasets from the web 


Given NLP’s capabilities, building complete language models is just a matter of 
gathering large text collections. Digging through large amounts of text enables 
machine learning algorithms using NLP to discover connections between words 
and derive useful concepts relative to specific contexts. For instance, when dis- 
cussing a mouse in the form of a device or an animal, a machine learning algo- 
rithm powered by NLP text processing can derive the precise topic from other 
hints in the phrase. Humans decipher these hints by having lived, seen, or read 
about the topic of the conversation. 


Computers also have the opportunity to see and read a lot. The web offers access 
to millions of documents, most of them freely accessible without restrictions. Web 
scraping allows machine learning algorithms to automatically feed their NLP pro- 
cesses and learn new capabilities in recognizing and classifying text. Developers 
have already done much to create NLP systems capable of understanding textual 
information better by leveraging the richness of the web. 


For instance, by using free text acquired from the web and other open text sources, 
such as dictionaries, scientists at Microsoft Research have developed various ver- 
sions of MindNet, a semantic network, which is a network of words connected by 
meaning. MindNet can find related words through synonyms, parts, causes, loca- 
tions, and sources. For instance, when you ask for the word car, MindNet provides 
answers such as vehicle (a synonym) and then connects vehicle to wheel because it 
is a specific part of a car, thus providing knowledge directly derived from text 
without anyone’s having specifically instructed MindNet about cars or how they’re 
made. You can read more about MindNet at https ://research.microsoft .com/ 
en-us/projects/mindnet/default.aspx. 
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Google developed something similar based on its Google Books project, helping 
build better language models for all Google’s applications. A public API based on 
Google’s work is the Ngram Viewer, which can explore how frequently certain 
combinations of tokens up to five-grams have appeared over time: https: // 
books. google.com/ngrams. 


Being able to retrieve information from the web allows even greater achievements. 
For example, you could build a dictionary of positive or negative words based on 
associated emoticons or emoji (https: //en.wikipedia.org/wiki/Emoji). 


Web scraping is a complex subject that could require an entire book to explain. 
This chapter provides you with an example of web scraping and an overview of 
what to expect. You need to install the Beautiful Soup package when using Python 
to perform web scraping (http: //www. crummy .com/software/Beauti fulSoup/). 
This package should already be part of your Anaconda installation, but if not, you 
can easily install it on your system by opening a command shell and issuing the 
command (see the blog posts at http: //blog. johnmuel lerbooks .com/2016/03/ 
23/installing-python-packages-part-1/ and http: //blog. johnmuellerbooks. 
com/2016/03/30/installing—-python-packages-part-2/ for more details on Python 
installation techniques): 


pip install beautifulsoup4 


Beautiful Soup is a package created by Leonard Richardson and is an excellent tool 
for scraping data from HTML or XML files retrieved from the web, even if they are 
malformed or written in a nonstandard way. The package name refers to the fact 
that HTML documents are made of tags, and when they are a mess, many devel- 
opers idiomatically call the document a tag soup. Thanks to Beautiful Soup, you 
can easily navigate in a page to locate the objects that matter and extract them as 
text, tables, or links. 


This example demonstrates how to download a table from a Wikipedia page con- 
taining all the major U.S. cities. Wikipedia (https: //www.wikipedia.org/) is a 
free-access and free-content Internet encyclopedia, enjoyed by millions of users 
every day, all around the world. Because its knowledge is free, open, and, most 
important, well structured, it’s a precious resource for learning from the web. 


Most publishers and many college instructors view Wikipedia as being a dubious 
source of information. Anyone can edit the entries it contains, and sometimes peo- 
ple do so in ways that slant the information politically or socially, or simply reflect 
a lack of knowledge (see http://www. foxbusiness.com/features/2015/09/02/ 
just-how-accurate-is-wikipedia.html andhttp://isites.harvard.edu/icb/ 
icb.do?keyword=k70847&pageid=icb.page346376). This means that the infor- 
mation you receive may not reflect reality. However, many studies show that the 
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community effort behind creating Wikipedia (see http: //www. livescience.com/ 
3295@-how-accurate-is-wikipedia.html, http: //www.cnet.com/news/study- 
wikipedia—as—accurate-as-britannica/, and http://www.zmescience.com/ 
science/study-wikipedia—25092014/) does tend to mitigate this issue partially. 
Even so, you need to exercise some level of care in taking Wikipedia entries at face 
value, just as you would any Internet content. Just because someone tells you 
something is so doesn’t make it true (no matter what form that information 
source might take). You need to cross-reference the information and verify the 
facts before accepting any Internet information source as factual, even Wikipedia. 
This said, the authors have verified every Wikipedia source used in the book as 
much as possible to ensure that you receive accurate information. 


Wikipedia has its own rules and terms of service, which you may read at https: // 
meta.wikimedia.org/wiki/Bot_policy#Unacceptable_usage. The terms of 
service forbid the use of bots for automated tasks, such as modifying the website 
(corrections and automatic posting), and bulk downloads (downloading massive 
amounts of data). Yet, Wikipedia is a great source for NLP analysis because you 
can download all its English articles at https: //dumps.wikimedia.org/enwiki/. 
Other languages also are available for download. Just consult https: //dumps. 
wikimedia.org/ for further information. 














from bs4 import BeautifulSoup 
import pandas as pd 
try: 
import urllib2 # Python 2.7.x 
except: 
import urllib.request as urllib2 Python 3.x 














wiki = "https://en.wikipedia.org/wiki/\ 
List_of_United_States_cities_by_population" 
header = {'User—Agent': 'Mozilla/5.@'} 

query = urllib2.Request(wiki, headers=header ) 
page = urllib2.urlopen(query ) 

soup = BeautifulSoup(page, "lxml1") 


After you upload the Beautiful Soup package, the code defines a header (stating 
that you are a human user using a browser) and a target page. The target page is 
a document containing a list of major U.S. cities: https: //en.wikipedia.org/ 
wiki/List_of_United_States_cities_by_population. The list also contains 
information about population and surface of the city. 





table = soup.find("table", 

{ "class" : "wikitable sortable" }) 
final_table = list() 
for row in table. findAll('tr'): 
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cells = row. findAl11("td") 
if len(cells) >=6: 
vi = cells[4 
v2 = cells 


find(text=True) 
[2] . find(text=True) 
v3 = cells[3] . find(text=True) 
v4 = cells[4].find(text=True) 
v5 = cells[6] .findAl1(text=True) 
v5 = v5[2].split()[Q] 
final_table.append([vi, v2, v3, v4, v5]) 
cols = ['City','State', 'Population_2014', 'Census_2010' 
, 'Land_Area_km2' ] 
df = pd.DataFrame(final_table, columns=cols) 


]. 
]. 
]. 
]. 


After downloading the page into the variable named soup, using the find() and 
findAl1() methods, you can look for a table (the <tr> and <td> tags). The cells 
variable contains a number of cell entries, each of which can contain text. The 
code looks inside each cell for textual information (v1 through v5) that it stores in 
a list (final_table). It then turns the list into a pandas DataFrame for further 
processing later. For example, you can use the DataFrame, df, to turn strings into 
numbers. Simply printing df outputs the resulting table. 


Handling problems with raw text 


Even though raw text wouldn’t seem to present a problem in parsing because it 
doesn’t contain any special formatting, you do have to consider how the text is 
stored and whether it contains special words within it. The multiple forms of 
encoding present on web pages can present interpretation problems that you need 
to consider as you work through the text. 


For example, the way the text is encoded can differ because of different operating 
systems, languages, and geographical areas. Be prepared to find a host of different 
encodings as you recover data from the web. Human language is complex, and the 
original ASCII coding, comprising just unaccented English letters, can’t represent 
all the different alphabets. That’s why so many encodings appeared with special 
characters. For example, a character can use either seven or eight bits for encoding 
purposes. The use of special characters can differ as well. In short, the interpreta- 
tion of bits used to create characters differs from encoding to encoding. You can 
see a host of encodings at http://www. i18nguy.com/unicode/codepages .html. 


Sometimes you need to work with encodings other than the default encoding set 
within the Python environment. When working with Python 3.x, you must rely on 
Universal Transformation Format 8-bit (UTF-8) as the encoding used to read and 
write files. This environment is always set for UTF-8, and trying to change it causes 
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an error message. However, when working with Python 2.x, you can choose other 
encodings. In this case, the default encoding is the American Standard Code for 
Information Interchange (ASCII), but you can change it to some other encoding. 


You can use this technique in any Python script. It can save your day when your 
code won’t work because of errors when Python can’t encode a character. How- 
ever, working at the IPython prompt is actually easier in this case. The following 
steps help you see how to deal with Unicode characters, but only when working 
with Python 2.x. (These steps are unnecessary and cause errors in the Python 3.x 
environment.) 


1. Open a copy of the IPython command prompt. 
You see the IPython window. 


2. Type the following code, pressing Enter after each line. 


import sys 
sys.getdefaultencoding( ) 


You see the default encoding for Python, which is ascii in Python 2.x (in Python 
3.x, it's utf-8 instead). If you really do want to work with Jupyter Notebook, 
create a new cell after this step. 


3. Type reload(sys) and press Enter. 
Python reloads the sys module and makes a special function available. 
4. Type sys.setdefaultencoding(utf-8’) and press Enter. 


Python does change the encoding, but you won't know that for certain until 
after the next step. If you really do want to work with Jupyter Notebook, create 
anew cell after this step. 


5. Type sys.getdefaultencoding() and press Enter. 


You see that the default encoding has now changed to utf-8. 


Changing the default encoding at the wrong time and in the incorrect way can 
prevent you from performing tasks such as importing modules. Make sure to test 
your code carefully and completely to ensure that any change in the default encod- 
ing won’t affect your ability to run the application. Good additional articles to read 
on this topic appear at http://blog.notdot.net/2010/07/Getting-—unicode- 
right-in-Python and http: //web.archive.org/web/2@1 207221 70929/http: // 
boodebr .org/main/python/al 1-about—python—and-unicode. 
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The previous NLP discussions in this chapter show how a machine learning algo- 
rithm can read text (after scraping it from the web) using the BoW representation 
and how NLP can enhance its understanding of text using text length normaliza- 
tion, TF-IDF model, and n-grams. The following sections demonstrate how to put 
text processing into use by learning to solve two common problems in textual 
analysis: classification and sentiment analysis. 


Performing classification tasks 


When you classify texts, you assign a document to a class because of the topics it 
discusses. You can discover the topics in a document in different ways. The sim- 
plest approach is prompted by the idea that if a group of people talks or writes 
about a topic, the people tend to use words from a limited vocabulary because they 
refer or relate to the same topic. When you share some meaning or are part of the 
same group, you tend to use the same language. Consequently, if you have a col- 
lection of texts and don’t know what topics the text references, you can reverse 
the previous reasoning — you can simply look for groups of words that tend to 
associate, so their newly formed group by dimensionality reduction may hint at 
the topics you’d like to know about. This is a typical unsupervised learning task. 


This learning task is a perfect application for the singular value decomposition 
(SVD) family of algorithms discussed in Chapter 13, because by reducing the num- 
ber of columns, the features (which, in a document, are the words) will gather in 
dimensions, and you can discover the topics by checking high-scoring words. SVD 
and Principal Components Analysis (PCA) provide features to relate both posi- 
tively and negatively to the newly created dimensions. So a resulting topic may be 
expressed by the presence of a word (high positive value) or by the absence of it 
(high negative value), making interpretation both tricky and counterintuitive for 
humans. The Scikit-learn package includes the Non-Negative Matrix Factoriza- 
tion (NMF) decomposition class (encountered in Chapter 19), which allows an 
original feature to relate only positively with the resulting dimensions. 


This example starts with a new experiment after loading the 20newsgroups data- 
set, a dataset collecting newsgroup postings scraped from the web, selecting only 
the posts regarding objects for sale and automatically removing headers, footers, 
and quotes. You may receive a warning message to the effect of, WARNING: sklearn. 
datasets. twenty_newsgroups:Downloading dataset from ..., with the URL of 
the site used for the download when working with this code. 


import warnings 
warnings. filterwarnings("ignore" ) 
from sklearn.datasets import fetch_2Qnewsgroups 


PART 5 Applying Learning to Real Problems 


REMEMBER 


dataset = fetch_2Q@newsgroups(shuffle=True, 
categories = ['misc.forsale'], 
remove=('headers', 'footers', 'quotes'), random_state=1@1 ) 
print ('Posts: %i' % len(dataset.data) ) 


Posts: 585 


The TfidVectorizer class is imported and set up to remove stop words (common 
words such as the or and) and keep only distinctive words, producing a matrix 
whose columns point to distinct words. 


from sklearn. feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer(max_df=0.95, 
min_df=2, stop_words='english' ) 
tfidf = vectorizer.fit_transform(dataset.data) 
from sklearn.decomposition import NMF 
n_topics = 5 
nmf = NMF(n_components=n_topics, random_state=101).fit(tfidf) 


As noted earlier in the chapter, term frequency-inverse document frequency 
(TF-IDF) is a simple calculation based on the frequency of a word in the document. 
It is weighted by the rarity of the word between all the documents available. 
Weighting words is an effective way to rule out words that cannot help you to 
classify or identify the document when processing text. For example, you can 
eliminate common parts of speech or other common words. 


As with other algorithms from the sklearn.decomposition module, the n_ 
components parameter indicates the number of desired components. If you’d like 
to look for more topics, you use a higher number. As the required number of topics 
increases, the reconstruction_err_ method reports lower error rates. It’s up to 
you to decide when to stop given the trade-off between more time spent on com- 
putations and more topics. 


The last part of the script outputs the resulting five topics. By reading the printed 
words, you can decide on the meaning of the extracted topics, thanks to product 
characteristics (for instance, the words drive, hard, card, and floppy refer to com- 
puters) or the exact product (for instance, comics, car, stereo, games). 


feature_names = vectorizer.get_feature_names() 
n_top_words = 15 
for topic_idx, topic in enumerate(nmf.components_): 
print ("Topic #%d:" % (topic_idx+),) 
print (" ".join([feature_names[i] for i in 


topic.argsort()[:-n_top_words - 1:-1]])) 
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Topic #1: 
drive hard card floppy monitor meg ram disk motherboard vga scsi 


brand 
color internal modem 
Topic #2: 
00 50 dos 20 10 15 cover ist new 25 price man 40 shipping comics 
Topic #3: 
condition excellent offer asking best car old sale good new miles 
10 000 
tape cd 
Topic #4: 
email looking games game mail interested send like thanks price 
package 
list sale want know 
Topic #9: 


shipping vcr stereo works obo included amp plus great volume vhs 
unc mathes 
gibbs radley 


You can explore the resulting model by looking into the attribute components_ 
from the trained NMF model. It consists of a NumPy ndarray holding positive val- 
ues for words connected to the topic. By using the argsort method, you can get 
the indexes of the top associations, whose high values indicate that they are the 
most representative words. 


print (nmf.components_[@,:].argsort() [:—-n_top_words-1 :-1] ) 
# Gets top words for topic @ 


[1337 1749 889 1572 2342 2263 2803 1290 2353 3615 3017 806 
1022 1938 
2334] 
Decoding the words’ indexes creates readable strings by calling them from the 
array derived from the get_feature_names method applied to the TfidfVectorizer 


that was previously fitted. 


print (vectorizer.get_feature_names() [1337] ) 
# Transforms index 1337 back to text 


drive 
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Analyzing reviews from e-commerce 


Sentiment is difficult to catch because humans use the same words to express 
even opposite sentiments. The expression you convey is a matter of how you con- 
struct your thoughts in a phrase, not simply the words used. Even though diction- 
aries of positive and negative words do exist and are helpful, they aren’t decisive 
because word context matters. You can use these dictionaries as a way to enrich 
textual features, but you have to rely more on machine learning if you want to 
achieve good results. 


It’s a good idea to see how positive and negative word dictionaries work. The 
AFINN-111 dictionary contains 2,477 positive and negative words and phrases 
(http: //www2. imm.dtu.dk/pubdb/views/publication_details.php?id=601@). 
Another good choice is the larger opinion lexicon by Hu and Liu that appears at 
https: //www.cs.uic.edu/~liub/FBS/sentiment-—analysis.html#lexicon. 
Both dictionaries contain English words. 


Using labeled examples that associate phrases to sentiments can create more 
effective predictors. In this example, you create a machine learning model based 
on a dataset containing reviews from Amazon, Yelp, and IMDB that you can find 
at the UCI, the machine learning repository, https: //archive.ics.uci.edu/ml/ 
datasets/Sentiment+Labelled+Sentences. 


This dataset was created for the paper “From Group to Individual Labels Using 
Deep Features,” by Kotzias et al., for KDD 2015. The dataset contains 3,000 labeled 
reviews equally divided from the three sources, and the data has a simple struc- 
ture. Some text is separated by a tab from a binary sentiment label where 1 is a 
positive sentiment and 0 a negative one. You can download the dataset and place 
it in your Python working directory using the following commands: 


try: 

import urllib2 # Python 2.7.x 
except: 

import urllib.request as urllib2 Python 3.x 
import requests, io, os, zipfile 














UCI_url = 'https://archive.ics.uci.edu/ml/\ 
machine-learning—databases/00331 /sentiment%20\ 
labelled%2@sentences.zip' 


response = requests.get(UCI_ur1) 
compressed_file = io.BytesIO(response.content ) 
z = zipfile.ZipFile(compressed_file) 

print ('Extracting in %s' % os.getcwd()) 

for name in z.namelist(): 
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filename = name.split('/')[-1] 
nameOK = ('MACOSX' not in name and '.DS' not in name) 
if filename and nameOK: 
newfile = os.path. join(os.getcwd(), 
os.path.basename( filename) ) 
with open(newfile, 'wb') as f: 
f.write(z.read(name) ) 
print ('\tunzipping %s' % newfile) 


In case the previous script doesn’t work, you can download the data (in zip 
format) directly from https://archive.ics.uci.edu/ml/machine-learning- 
databases/00331/ and expand it using your favorite unzipper. You’ll find the 
imdb_labelled.txt file inside the newly created sentiment labelled sentences 
directory. After downloading the files, you can upload the IMDB file to a pandas 
DataFrame by using the read_csv function. 


import numpy as np 

import pandas as pd 

dataset = 'imdb_labelled.txt' 

data = pd.read_csv(dataset, header=None, sep=r"\t", 
engine='python' ) 

data.columns = ['review', 'sentiment' ] 


Exploring the textual data is quite interesting. You’ll find all short phrases such as 
“Wasted two hours” or “It was so cool.” Some are clearly ambiguous for a com- 
puter, such as “Waste your money on this game.” Even though waste has a nega- 
tive meaning, the imperative makes the phrase sound positive. A machine learning 
algorithm can learn to decipher ambiguous phrases like these only after seeing 
many variants. The next step is to build the model by splitting the data into train- 
ing and test sets. 


from sklearn.cross_validation import train_test_sp1lit 
corpus, test_corpus, y, yt = train_test_split( 
data.ix[:,0], data.ix[:,1], 
test_size=0.25, random_state=101 ) 


After splitting the data, the code transforms the text using most of the NLP tech- 
niques described in this chapter: token counts, unigrams and bigrams, stop words 
removal, text length normalization, and TF-IDF transformation. 


from sklearn.feature_extraction import text 
vectorizer = text.CountVectorizer(ngram_range=(1,2), 

stop_words='english').fit(corpus) 
TfidF = text.TfidfTransformer() 
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X = TfidF.fit_transform(vectorizer.transform(corpus) ) 
Xt = TfidF.transform(vectorizer.transform(test_corpus) ) 


After the text for both the training and test sets is ready, the algorithm can learn 
sentiment using a linear support vector machine. This kind of support vector 
machine supports L2 regularization, so the code must search for the best C param- 
eter using the grid search approach. 


from sklearn.svm import LinearSVC 

from sklearn.grid_search import GridSearchCV 

param_grid = {'C': [0.0@1, @.1, 1.0, 10.0, 100.0]} 

clf = GridSearchCV(LinearSVC(loss='hinge', 
random_state=101), param_grid) 

Gilic S Gli, ae V) 

print ("Best parameters: %s" % clf.best_params_) 


Best parameters: {'C': 1.0} 


Now that the code has determined the best hyper-parameter for the problem, you 
can test performance on the test set using the accuracy measure, the percentage of 
correct times the code can guess the correct sentiment. 


from sklearn.metrics import accuracy_score 

solution = clf.predict(Xt) 

print("Achieved accuracy: %0.3f" % 
accuracy_score(yt, solution) ) 


Achieved accuracy: 0.816 
The results indicate accuracy of higher than 80 percent, but determining which 
phrases tricked the algorithm into making a wrong prediction is interesting. You 
can print the misclassified texts and consider what the learning algorithm is 


missing in terms of learning from text. 


print(test_corpus|[yt!=solution] ) 


601 There is simply no excuse for something this p... 
32 This is the kind of money that is wasted prope... 
887 At any rate this film stinks, its not funny, a... 
668 Speaking of the music, it is unbearably predic... 
408 It really created a unique feeling though. 

413 The camera really likes her in this movie. 

138 I saw "Mirrormask" last night and it was an un... 
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This was a poor remake of "My Best Friends Wed... 


Rating: 1 out of 10. 


I'm so sorry but I really can't recommend it t... 
A world better than 95% of the garbage in the ... 
But I recommend waiting for their future effor... 
The film deserves strong kudos for taking this... 


I don't think you will be disappointed. 
It is shameful. 


This movie now joins Revenge of the Boogeyman ... 
You share General Loewenhielm's exquisite joy ... 
It's this pandering to the audience that sabot... 
Still, I do like this movie for it's empowerme... 


Of course, the acting is blah. 
Waste your money on this game. 


The only place good for this film is in the ga... 


My only problem is I thought the actor playing. 
Go watch it! 
This movie is also revealing. 


I love Lane, but I've never seen her in a movi... 
Tom Wilkinson broke my heart at the end... and... 
There are massive levels, massive unlockable c... 


It is not good. 


I struggle to find anything bad to say about i... 


What on earth is Irons doing in this film? 
Highly unrecommended. 


A mature, subtle script that suggests and occa... 
Considering the relations off screen between T... 
Easily, none other cartoon made me laugh ina... 


A bit predictable. 


I like Armand Assante & my cable company's sum... 


I won't say any more - I don't like spoilers, 


Im big fan of RPG games too, but this movie, i... 
This would not even be good as a made for TV f... 
At no point in the proceedings does it look re... 
And, FINALLY, after all that, we get to an end... 


Too politically correct. 


Rating: @/1@ (Grade: Z) Note: The Show Is So B... 


This film has no redeeming features. 


This movie creates its own universe, and is fa... 


Name: review, dtype: object 
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IN THIS CHAPTER 


Understanding why the revolution in 
recommending products is important 


Obtaining rating data 


Using SVD when working with products 
and movies 


Recommending 
Products and Movies 


ne of the oldest and most common sales techniques is to recommend 

something to a customer based on what you know about the customer’s 

needs and wants. If people buy one product, they might buy another asso- 
ciated product if given a good reason to do so. They may not even have thought 
about the need for the second product until the salesperson recommends it, yet 
they really do need to have it in order to use the primary product. For this reason 
alone, most people actually like to get recommendations. Given that web pages 
now serve as a Salesperson in many cases, recommender systems are a necessary 
part of any serious sales effort on the web. This chapter helps you better under- 
stand the significance of the recommender revolution in all sorts of venues. 


Recommender systems serve all sorts of other needs. For example, you might see 
an interesting movie title, read the synopsis, and still not know whether you’re 
likely to find it a good movie. Watching the trailer might prove equally fruitless. 
Only after you see the reviews provided by others do you feel you have enough 
information to make a good decision. You also find methods for obtaining and 
using rating data in this chapter. 


Gathering, organizing, and ranking such information is hard, though, and infor- 
mation overflow is the bane of the Internet. A recommender system can perform 
all the required work for you in the background, making the work of getting to a 
decision a lot easier. You may not even realize that search engines are actually 
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huge recommender systems. The Google search engine, for instance, can provide 
personalized search results based on your previous search history. 


Recommender systems do more than just make recommendations. After reading 
images and texts, machine learning algorithms can also read a person’s personal- 
ity, preferences, and needs and act accordingly. This chapter helps you understand 
how all these activities take place by exploring techniques such as singular value 
decomposition (SVD). 


Realizing the Revolution 


370 


A recommender system can suggest items or actions of interest to a user, after 
having learned the user’s preferences over time. The technology, which is based 
on data and machine learning techniques (both supervised and unsupervised), has 
appeared on the Internet for about two decades. Today you can find recommender 
systems almost everywhere, and they’re likely to play an even larger role in the 
future under the guise of personal assistants, such as Siri or some other artificial- 
intelligence—based digital assistant. 


The drivers for users and companies to adopt recommender systems are different 
but complementary. Users have a strong motivation to reduce the complexity of 
the modern world (regardless of whether the issue is finding the right product or 
a place to eat) and avoid information overload. Companies, on the other hand, find 
that recommender systems provide a practical way to communicate in a personal- 
ized way with their customers and successfully push sales. 


Recommender systems actually started as a means to handle information over- 
load. The Xerox Palo Alto Research Center built the first recommender in 1992. 
Named Tapestry, it handled the increasing number of emails received by center 
researchers. The idea of collaborative filtering was born (learned from users by 
leveraging similarities in preferences), and the GroupLens project soon extended 
it to news selection and movie recommendations (the MovieLens project, whose 
data you use in this chapter). 


When giant players in the e-commerce sector, such as Amazon, started adopting 
recommender systems, the idea went mainstream and spread widely in e-commerce. 
Netflix did the rest by promoting recommenders as a business tool and sponsoring a 
competition to improve its recommender system (https: //en.wikipedia.org/ 
wiki/Netflix_Prize) that involved various teams for quite a long time. The result 
is an innovative recommender technology that uses SVD and Restricted Boltzmann 
Machines (a kind of unsupervised neural network discussed in Chapter 16). 


However, recommender systems aren’t limited to promoting products. Since 
2002, a new kind of Internet service has made its appearance: social networks 
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such as Friendster, Myspace, Facebook, and LinkedIn. These services promote link 
exchanges between users and share information such as posts, pictures, and vid- 
eos. In addition, search engines such as Google amassed user response informa- 
tion to offer more personalized services and understand how to match users’ 
desires when responding to users’ queries better (https: //en.wikipedia.org/ 
wiki/RankBrain). 


Recommendations have become so pervasive in guiding people’s daily life that 
experts now worry about the impact on our ability to assume independent deci- 
sions and perceive the world in freedom. You can read about this concern in the 
article at https: //en.wikipedia.org/wiki/Filter_bubble. The history of rec- 
ommender systems is one of machines striving to learn about our minds and 
hearts, to make our lives easier, and to promote the business of their creators. 


Downloading Rating Data 


Getting good rating data can be hard. Later in this chapter, you use the MovieLens 
dataset to see how SVD can help you in creating movie recommendations. How- 
ever, you have other databases at your disposal. The following sections describe 
the MovieLens dataset and the data logs contained in MSWeb — both of which 
work quite well when experimenting with recommender systems. 


Trudging through the MovieLens dataset 


The MovieLens site (https: //movielens.org/) is all about helping you find a 
movie you might like. After all, with millions of movies out there, finding some- 
thing new and interesting could take time that you don’t want to spend. The setup 
works by asking you to input ratings for movies that you already know about. The 
MovieLens site then makes recommendations for you based on your ratings. In 
short, your ratings teach an algorithm what to look for, and then the site applies 
this algorithm to the entire dataset. 


You can obtain the MovieLens dataset at http://grouplens.org/datasets/ 
movielens/. The interesting thing about this site is that you can download all or 
part of the dataset based on how you want to interact with it. You can find down- 
loads in the following sizes: 


>> 100,000 ratings from 1,000 users on 1,700 movies 
>> 1 million ratings from 6,000 users on 4,000 movies 


>» 10 million ratings and 100,000 tag applications applied to 10,000 movies by 
72,000 users 
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>> 20 million ratings and 465,000 tag applications applied to 27,000 movies by 
138,000 users 


>> MovieLens's latest dataset in small or full sizes (At this writing, the full size 
contained 21,000,000 ratings and 470,000 tag applications applied to 27,000 
movies by 230,000 users; its size increase.) 


This dataset presents you with an opportunity to work with user-generated data 
using both supervised and unsupervised techniques. The large datasets present spe- 
cial challenges that only big data can provide. You can find some starter information 
for working with supervised and unsupervised techniques in Chapters 12 and 14. 


The following example uses a version called the MovieLense ratings dataset found 
in the R recommenderlab library. After calling the library from R (and installing it 
if it isn’t already available on your system), the code uploads the library into 
memory and starts exploring the data. 


if (!"recommenderlab" %in% rownames( installed. packages() ) ) 
{install .packages("recommenderlab" ) } 

library("recommender lab" ) 

data(MovieLense ) 

print(MovieLense) 


943 x 1664 rating matrix of class ‘realRatingMatrix’ with 
99392 ratings. 


Printing the dataset doesn’t print any data but reports to you that the dataset is a 
matrix of 943 rows (the users) and 1664 columns (the movies), containing 99392 
ratings. MovieLense is actually a sparse matrix, a matrix that compresses the data 
by removing most of the zero values. You can normally operate on a sparse matrix 
as you would when using a standard matrix. When necessary, you can convert 
it into a standard dense matrix for specific statistics by using code like the 
following: 


print(table(as.vector(as(MovieLense, "matrix")))) 


1 2 3 4 5 
6059 11307 27002 33947 21077 


The output displays the rankings distribution. Rankings range from 1 to 5, and 
there are more positive rankings than negative ones. This often happens with rat- 
ing data: It has some imbalance in favor of positive data because users tend to buy 
or watch what they believe they will like. Disappointment mostly motivates nega- 
tive ratings because expectations aren’t satisfied. You can also report how many 
films each user has rated on average and how many users have rated each film: 
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summary (colCounts(MovieLense) ) 
Min. ist Qu. Median Mean 3rd Qu. Max. 
1.00 7.00 27.00 59.73 80.00 583.00 
summary(rowCounts(MovieLense)) 
Min. ist Qu. Median Mean 3rd Qu. Max. 
19.0 32.0 64.0 105.4 147.5 735.0 


It is also quite easy to go into deeper detail and find how users rank a particular 
film. 


average_ratings <- colMeans(MovieLense) 


print(average_ratings[50]) 
Star Wars (1977) 
4.358491 


print (colCounts(MovieLense[,5Q] ) ) 
Star Wars (1977) 
583 


In this example, 583 users have rated the fiftieth movie, the original Start Wars 
from 1977, which scored an average rating of 4.36. 


Navigating through anonymous web data 


Another interesting dataset that you can use to learn from preferences is the 
MSWeb dataset. It consists of a week’s worth of anonymously recorded data from 
the Microsoft website. In this case, the recorded information is about a behavior, 
not a judgment, thus values are expressed in a binary form. As with the MovieLens 
dataset, you can download the MSWeb dataset from the R recommenderlab library, 
get information about its structure, and explore how its values are distributed. 


data(MSWeb ) 

print (MSWeb ) 

32710 x 285 rating matrix of class ‘binaryRatingMatrix’ 
with 98653 ratings. 


print(table(as.vector(as(MSWeb, "matrix")))) 


FALSE TRUE 
9223697 98653 


The dataset, stored in a sparse matrix, consists of 32710 randomly selected Micro- 
soft website users, and the columns represent 285 Vroots. A Vroot is a series of 


grouped website pages. Together they constitute an area of the website. The binary 
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values show whether someone has visited a certain area. (You just see a flag; you 
don’t see how many times the user has actually visited that website area.) 


The idea is that a user’s visit to a certain area indicates a specific interest. For 
instance, when a user visits pages to learn about productivity software along with 
visits to a page containing terms and prices, this behavior indicates an interest in 
acquiring the productivity software soon. Useful recommendations can be based 
on such inferences about a user’s desire to buy certain versions of the productivity 
software or bundles of different software and services. 


The remainder of the chapter uses the MovieLens dataset exclusively. However, 

you should use the knowledge gained in this chapter to explore the MSWeb dataset 

using the same methodologies because they apply equally to rating data and 
TIP binary data. 


Encountering the limits of rating data 


For recommender systems to work well, they need to know about you as well as 
other people, both like you and different from you. Acquiring rating data allows a 
recommender system to learn from the experiences of multiple customers. Rating 
data could derive from a judgment (such as rating a product using stars or num- 
bers) or a fact (a binary 1/0 that simply states that you bought the product, saw a 
movie, or stopped browsing at a certain web page). 


No matter the data source or type, rating data is always about behaviors. To rate a 

movie, you have to decide to see it, watch it, and then rate it based on your experi- 

ence of seeing the movie. Actual recommender systems learn from rating data in 
REMEMBER different ways: 


>» Collaborative filtering: Matches raters based on movie or product similari- 
ties used in the past. You can get recommendations based on items liked by 
people similar to you or on items similar to those you like. 


>» Content-based filtering: Goes beyond the fact that you watched a movie. It 
examines the features relative to you and the movie to determine whether a 
match exists based on the larger categories that the features represent. For 
instance, if you are a female who likes action movies, the recommender will 
look for suggestions that include the intersection of these two categories. 


>> Knowledge based recommendations: Based on meta-data, such as 
preferences expressed by users and product descriptions. It relies on machine 
learning and is effective when you do not have enough behavioral data to 
determine user or product characteristics. This is called a cold start and 
represents one of the most difficult recommender tasks because you don't 
have access to either collaborative filtering or content-based filtering. 
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When using collaborative filtering, you need to calculate similarity (see Chapter 14 
for a discussion of the use of similarity measures). Apart from Euclidean, Manhat- 
tan, and Chebyshev distances, the remainder of this section discusses cosine 
similarity. Cosine similarity measures the angular cosine distance between two 
vectors, which may seem like a difficult concept to grasp but is just a way to 
measure angles in data spaces. 


Imagine a space made of features and having two points. Using the formulations 
found in Chapter 14, you can measure the distance between the points. For 
instance, you could use the Euclidean distance, which is a perfect choice when you 
have few dimensions, but which fails miserably when you have multiple dimen- 
sions because of the curse of dimensionality (https: //en.wikipedia.org/wiki/ 
Curse_of_dimensionality). 


The idea behind the cosine distance is to use the angle created by the two points 
connected to the space origin (the point where all dimensions are zero) instead. 
If the points are near, the angle is narrow, no matter how many dimensions are 
there. If they are far away, the angle is quite large. Cosine similarity implements 
the cosine distance as a percentage and is quite effective in telling whether a user 
is similar to another or whether a film can be associated to another because the 
same users favor it. The following example locates the movies that are the most 
similar movies to movie 50, Star Wars. 


print (colnames(MovieLense[ ,5@] )) 
[1] "Star Wars (1977)" 


similar_movies <- similarity(MovieLense|[ ,50], 
MovieLense|[ ,-5@], 
method ="cosine", 
which = "items") 

colnames(similar_movies) [which(similar_movies>@.7@) ] 

[1] "Toy Story (1995)" 

"Empire Strikes Back, The (1980)" 

[3] "Raiders of the Lost Ark (1981)" 

"Return of the Jedi (1983)" 


Leveraging SVD 


A property of SVD is to compress the original data at such a level and in such a 
smart way that, in certain situations, the technique can actually create new mean- 
ingful and useful features, not just compressed variables. The following sections 
help you understand what role SVD plays in recommender systems. 
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Considering the origins of SVD 


SVD is a method from linear algebra that can decompose an initial matrix into the 
multiplication of three derived matrices. The three derived matrices contain the 
same information as the initial matrix, but in a way that expresses any redundant 
information (expressed by statistical variance) only once. The benefit of the new 
variable set is that the variables have an orderly arrangement according to the 
initial variance portion contained in the original matrix. 


SVD builds the new features using a weighted summation of the initial features. It 
places features with the most variance leftmost in the new matrix, whereas fea- 
tures with the least or no variance appear on the right side. As a result, no correla- 
tion exists between the features. (Correlation between features is an indicator of 
information redundancy, as explained in the previous paragraph.) Here’s the for- 
mulation of SVD: 


A=U*#D#V" 


For compression purposes, you need to know only about matrices U and D, but 
examining the role of each resulting matrix helps, starting with the origin. Ais a 
matrix n*p, where n is the number of examples and p is the number of variables. 
As an example, consider a matrix containing the purchase history of n customers, 
who bought something in the p range of available products. The matrix values are 
populated with quantities that customers purchased. As another example, imag- 
ine a matrix in which rows are individuals, columns are movies, and the content 
of the matrix is a movie rating (which is exactly what the MovieLens dataset 
contains). 


After the SVD computation completes, you obtain the U, S, and V matrices. Uisa 
matrix of dimensions n by k, where k is p, exactly the same dimensions of the 
original matrix. It contains the information about the original rows on a recon- 
structed set of columns. Therefore, if the first row on the original matrix is a vec- 
tor of items that Mr. Smith bought, the first row of the reconstructed U matrix will 
still represent Mr. Smith, but the vector will have different values. The new U 
matrix values are a weighted combination of the values in the original columns. 


Of course, you might wonder how the algorithm creates these combinations. The 
combinations are devised to concentrate the most variance possible on the first 
column. The algorithm then concentrates most of the residual variance in the sec- 
ond column with the constraint that the second column is uncorrelated with the 
first one, distributing the decreasing residual variance to each column in succes- 
sion. By concentrating the variance in specific columns, the original features that 
were correlated are summed into the same columns of the new U matrix, thus 
cancelling any previous redundancy present. As a result, the new columns in U 
don’t have any correlation between themselves, and SVD distributes all the original 
information in unique, nonredundant features. Moreover, given that correlations 
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may indicate causality (but correlation isn’t causation; it can simply hint at it — a 
necessary but not sufficient condition), cumulating the same variance creates a 
rough estimate of the variance’s root cause. 


V is the same as the U matrix, except that its shape is p*k and it expresses the 
original features with new cases as a combination of the original examples. This 
means that you’ll find new examples composed of customers with the same buying 
habits. For instance, SVD compresses people buying certain products into a single 
case that you can interpret as a homogeneous group or as an archetypal customer. 


In such reconstruction, D, a diagonal matrix (only the diagonal has values) con- 
tains information about the amount of variance computed and stored in each new 
feature in the U and V matrices. By cumulating the values along the matrix and 
making a ratio with the sum of all the diagonal values, you can see that the vari- 
ance is concentrated on the first leftmost features, while the rightmost are almost 
zero or an insignificant value. Therefore, an original matrix with 100 features can 
be decomposed and have an S matrix whose first 10 new reconstructed features 
represent more than 90 percent of the original variance. 


SVD has many optimizing variants with slightly different objectives. The core 
functions of these algorithms are similar to SVD. Principal component analysis 
(PCA) focuses on common variance. It’s the most popular algorithm and is used in 
machine learning preprocessing applications. (You can learn more about PCA in 
Chapter 13.) 


A great SVD property is that the technique can create new meaningful and useful 
features, not just compressed variables, as a byproduct of compression in certain 
situations. In this sense, you can consider SVD a feature creation technique. 


Understanding the SVD connection 


If your data contains hints and clues about a hidden cause or motif, an SVD can put 
them together and offer you proper answers and insights. That is especially true 
when your data is made up of interesting pieces of information like the ones in the 
following list: 


>> Text in documents hints at ideas and meaningful categories: Just as you 
can make up your mind about discussion topics by reading blogs and news- 
groups, so can SVD help you deduce a meaningful classification of groups of 
documents or the specific topics being written about in each of them. 


>» Reviews of specific movies or books hint at your personal preferences 
and larger product categories: If you say on a rating site that you loved the 
original Star Trek series collection, the algorithm can easily determine what 
you like in terms of other films, consumer products, or even personality types. 


CHAPTER 21 Recommending Products and Movies 377 


378 


An example of a method based on SVD is latent semantic indexing (LSI), which 
has been successfully used to associate documents and words based on the idea 
that words, though different, tend to have the same meaning when placed in sim- 
ilar contexts. This type of analysis suggests not only synonymous words but also 
higher grouping concepts. For example, an LSI analysis on some sample sports 
news may group baseball teams of the Major League based solely on the 
co-occurrence of team names in similar articles, without any previous knowledge 
of what a baseball team or the Major League is. 


Other interesting applications for data reduction are systems for generating rec- 
ommendations about the things you may like to buy or know more about. You 
likely have quite a few occasions to see recommenders in action. On most 
e-commerce websites, after logging in, visiting some product pages, and rating or 
putting a product into your electronic basket, you see other buying opportunities 
based on other customers’ previous experiences. (As mentioned previously, this 
method is called collaborative filtering.) SVD can implement collaborative filtering 
in a more robust way, relying not just on information from single products but 
also on the wider information of a product in a set. For example, collaborative 
filtering can determine not only that you liked the film Raiders of the Lost Ark but 
also that you generally like all action and adventure movies. 


You can implement collaborative recommendations based on simple means or 
frequencies calculated on other customers’ set of purchased items or on ratings 
using SVD. This approach helps you reliably generate recommendations even in 
the case of products that the vendor seldom sells or that are quite new to users. 


Seeing SVD in action 


For the example in this section, you use the MovieLense dataset described in the 
“Trudging through the MovieLens dataset” section, earlier in the chapter. After 
uploading it, you choose settings to work with users and movies with a minimum 
number of available ratings: 


ratings_movies <- MovieLense[rowCounts(MovieLense) > 10, 
colCounts(MovieLense) > 50] 


After you filter the useful profiles, you center the ratings of each user. You sub- 
tract the mean from the rating values for each user to center the rating. This 
operation lessens the effect of extreme ratings (giving only highest or lowest 
ratings). It also makes assigning missing values easy because each user can rate 
only a few movies. The missing ratings, which SVD requires to complete its com- 
putations, are determined using an average, which means zero values after mean 
centering. 
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ratings_movies_norm <- normalize(ratings_movies, row=TRUE) 
densematrix <- as(ratings_movies_norm, "matrix" ) 
densematrix[is.na(densematrix)] <- @ 


After normalizing, setting the missing ratings to zero, and making the matrix 
dense (you’re not using a sparse matrix anymore), the code uploads the irlba 
library (https://cran.r-project.org/web/packages/irlba/index.html). If 
the library isn’t present on your system, you can use the following code snippet to 
install it: 


if (!"irlba" %in% rownames(installed.packages()) ) 
{install.packages("irlba") } 

library("irlba") 

SVD <- irlba(densematrix, nv = 50, nu = 50) 


The library’s core algorithm is the augmented implicitly restarted lanczos bi- 
diagonalization algorithm (IRLBA), which computes an approximate SVD limited 
to a certain number of reconstructed dimensions. By computing just the required 
dimensions, it saves times and enables you to apply SVD even to immense matri- 
ces. The Netflix Prize successfully used the algorithm, and it works best when 
computing few SVD dimensions (https: //www.youtube.com/watch?feature= 
player_embedded&v=ipkuRgYT8_I). 


The following code explores the matrices extracted by the ir1ba function and uses 
them as a smaller dataset with similar informative content to the original sparse 
matrix. Notice how the matrix u has the same number of rows as the initial movie 
matrix, while the matrix v has the number of rows equal to the number of col- 
umns of the original matrix. Columns are always 50, the number of dimensions 
requested as part of the irlba call. 


print(attributes(SVD) ) 


$names 

[1] "g" "u" "y" "iter" "mprod" 
print(dim(densematrix) ) 

[1] 943 594 

print(dim(SVD$u) ) 

[1] 943 50 

print(dim(SVD$v) ) 

[1] 591 50 

print(length(SVD$d) ) 

[1] 50 
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The example doesn’t stop at learning from data using an unsupervised approach. 
It can also learn the likelihood of a user’s having seen a certain film from the data, 
that is, you can determine whether a person has seen a film or not based on movie 
interests, thanks to the SVD reconstruction. To make this analysis possible, the 
code selects a film, extracts it from the dataset, and recomputes the SVD decom- 
position. This way, the output doesn’t have any hints about the specific film inside 
the reconstructed matrices. 


chosen_movie <- 45 
print (paste("Choosen film:", 
colnames(densematrix) [chosen_movie] ) ) 
answer <- as. factor(as.numeric( 
densematrix[,chosen_movie] !=@) ) 
SVD <- irlba(densematrix[,-chosen_movie], nv=5@, nu=5@) 
rotation <- data. frame(movies=colnames( 
densematrix[,—-chosen_movie] ),SVD$v) 


[1] "Choosen film: Pulp Fiction (1994)" 


Before proceeding to learn from data, the example takes advantage of the v matrix 
produced from the SVD. The v matrix, the item matrix, contains information about 
the films, and it tells how SVD calculates the features in the matrix u (user matrix). 
For learning tasks, the example uses the matrix u and its reconstructed 50 com- 
ponents. As a machine learning tool, the example relies on a Random Forest 
ensemble of decision tree models. 


if (!"randomForest" %in% rownames(installed.packages()) ) 
{install .packages("randomForest" ) } 

library("randomForest" ) 

train <- sample(1:length(answer ) , 500) 

user_matrix <- as.data. frame(SVD$u[train, ] ) 

target_matrix <- as.data.frame(SVD$u[-train, ] ) 

model <- randomForest(answer[train] ~., data=user_matrix, 

importance=TRUE ) 


To test the learned model effectively, the example uses 500 users as a training set. 
The example uses the remaining users to test the prediction accuracy. 


response <- predict(model, newdata=target_matrix, 
n.trees=model$n. trees) 
confusion_matrix <- table(answer[-train] ,response) 
precision <- confusion_matrix[2,2] / 
sum(confusion_matrix[,2] ) 
recall <- confusion_matrix[2,2] / 
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sum(confusion_matrix[2, ] ) 
print (confusion_matrix) 
print(paste("Precision:",round(precision,3), 
"Recall:",round(recall,3))) 


response 

© 1 
© 214 50 
1 36 143 


[1] "Precision: @.741 Recall: @.799" 


By arranging the predictions on the test set in a confusion matrix, you can observe 
that the precision is quite high, as is the recall. Precision is the percentage of cor- 
rect predictions. The Random Forest predicts that 210 users have seen the film in 
the test set and predicts a true assertion for 158 users of them, which amounts to 
a precision of 75.2 percent. Checking the true total of users who have seen the film 
(the previous figure was predicted) shows that 188 users watched the film. Because 
158 predicted users equates to 84 percent of the 188 users, the model has a recall 
of 84 percent, which is the capability of predicting all the positive users. 


The model that you prepared can also provide insight about reconstructed features 
that helps determine which users have seen a certain film. You can inspect the 
information by printing the importance derived from the Random Forest model, 
as shown in Figure 21-1. 


var ImpPlot (model ,n.var=10) 




















v1 o v1 o 
V3 o V3 o 
V22 o V6 o 
V7 ° V22 a 
V21 ° V4 o 
V6 o V7 o 
V12 o V21 o 
V29 o v5 o 
© o 
FIGURE 21-1: “4 xig 
Importance v5 jo v49 ° 
measures derived 
from Random T T T T T T T T T T T 
10 15 20 25 0 5 10 15 20 25 30 
Forest. MeanDecreaseAccuracy MeanDecreaseGini 


CHAPTER 21 Recommending Products and Movies 381 


382 


TIP 


REMEMBER 


According to the plotted importance, the first component of the u matrix derived 
from the SVD is the most predictive in determining whether a user has seen the 
movie. To determine what this component means, you can print its ordered ele- 
ments from the one contributing the most in a positive fashion to the one contrib- 
uting the most in a negative one. 


rotation[order(rotation[,2]),1:2] 


The resulting output is a long enumeration of films. At the start of the vector, in 
the negative range, you find films such as Star Wars, The Godfather, Raiders of the 
Lost Ark, and The Silence of the Lambs. At the end of the vector, in the positive sector, 
you see films such as Liar Liar, Mars Attacks!, and Broken Arrow. Because the nega- 
tive values are stronger in absolute value than the positive ones, these values seem 
to convey more sense to the SVD component. Thus, users who see blockbusters 
such as Star Wars, The Godfather, or Raiders of the Lost Ark are also likely to see a film 
like Pulp Fiction. You can test the theory directly by seeing how similar they are to 
the target film using the cosine distance. 


similarity(ratings_movies[,45], ratings_movies[,145], 
method ="cosine", which = "items") 


Raiders of the Lost Ark (1981) 
Pulp Fiction (1994) @.7374849 


similarity(ratings_movies[,45], ratings_movies|[,82], 
method ="cosine", which = "items") 


Silence of the Lambs, The (1991) 
Pulp Fiction (1994) ©. 7492093 


As shown in the output, users who see films such as Raiders of the Lost Ark or The 
Silence of the Lambs are also likely to see Pulp Fiction. 


Interpreting results from an SVD is quite an art and requires a lot of domain 
knowledge (in this case, that means having movie experts). SVD puts sets of users 
and items together in the u and v matrices. It’s up to you to use them for building 
recommendations and, if necessary, to provide them with an explanation about 
their reconstructed features based on your knowledge and intuition. 


SVD always finds the best way to relate a row or column in your data, discovering 
complex interactions or relations that you didn’t imagine before. You don’t need 
to imagine anything in advance; it’s a fully data-driven approach. 
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IN THIS PART... 





__ Discovering packages and libraries you need to add to 
your toolbox 


Learning about frameworks you can use to make 
things easier 


Getting tips for learning from data better 


Improving your use of models 


IN THIS CHAPTER 


Analyzing live-streaming data using 
Cloudera Oryx 


Recognizing objects in images using 
CUDA-Convnet 


Adding image recognition to web- 
based apps using ConvNetJS 


Obtaining an R SVM implementation 
using e1071 


Adding R GBM optimization support 
using gbm 


Performing natural language interface 
tasks using Gensim 


Creating a generalized linear model 
using gimnet 


Growing a forest of R decision trees 
using randomForest 


Performing scientific tasks of all sorts 
in Python using SciPy 


Obtaining GBDT, GBRT, and GBM support 
for a variety of languages with XGBoost 


Ten Machine Learning 
Packages to Master 


he book provides you with a wealth of information about specific machine 
learning packages such as caret (R) and NumPy (Python). Of course, these 
are good, versatile packages you can use to begin your machine learning 
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journey. It’s important to have more than a few tools in your toolbox, which is 
where the suggestions found in this chapter come into play. These packages pro- 
vide you with additional machine learning insights and capabilities. Even though 
there are many other packages available on the market, these packages will give 
you some ideas of where to go next and make it easier to explore other packages 
as well. 


Cloudera Oryx 


http: //www.cloudera.com/ 


Cloudera Oryx is a machine learning project for Apache Hadoop (http: //hadoop. 
apache.org/) that provides you with a basis for performing machine learning 
tasks. It emphasizes the use of live data streaming. This product helps you add 
security, governance, and management functionality that’s missing from Hadoop 
so that you can create enterprise-level applications with greater ease. 


The functionality provided by Oryx builds on Apache Kafka (http://kafka. 
apache.org/) and Apache Spark (http: //spark.apache.org/) (see the discus- 
sion of Spark in Chapter 8). Common tasks for this product are real-time spam 
filters and recommendation engines. You can download Oryx from https: // 
github.com/cloudera/oryx. 


CUDA-Convnet 
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https: //github.com/dnouri/cuda—convnet 


As described in Chapters 1 and 3, GPUs enable you to perform machine learning 
tasks significantly faster. Chapter 6 tells how you can add Accelerate to Anaconda 
to provide basic GPU support for that environment, and Chapter 8 tells about 
Caffe, a separate product that you can use to process images using Python or 
MATLAB. You also find GPUs mentioned in Chapter 12 concerning neural networks, 
and in Chapter 19 in the discussion about deep learning applications in image 
recognition. Even though the book doesn’t discuss how to use a GPU, and it just 
touches on the complex alchemy of a convolutional network, if you need to per- 
form serious image processing, you obviously need a GPU to do it. None of the 
libraries discussed in previous chapters provide as much of a speed boost as they 
could because they rely on generalized routines. The CUDA-Convnet library 
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provides specific support for NVidia’s CUDA (http: //www.nvidia.com/object/ 
cuda_home_new.htm1) GPU processor, which means that it can provide faster pro- 
cessing at the cost of platform flexibility (you must have a CUDA processor in your 
system). For the most part, this library sees use in neural-network applications. 


ConvNetjS 


e1071 


https: //github.com/karpathy/convnet js 


As described for CUDA-Convnet, being able to recognize objects in images is an 
important machine learning task, but getting the job done without a good library 
can prove difficult or impossible. While CUDA-Convnet provides support for 
heavy-duty desktop applications, ConvNetJS provides image-processing support 
for JavaScript applications. The important feature of this library is that it works 
asynchronously. When you make a call, the application continues to work. An 
asynchronous response lets the application know when tasks, such as training, 
complete so that the user doesn’t feel as if the browser has frozen (become unre- 
sponsive in some way). Given that these tasks can take a long time to complete, 
the asynchronous call support is essential. 


https: //cran.r-project.org/web/packages/e1071/index.htm1l 


You saw e1071 at work in Chapters 12 and 17. This R library, developed by the TU 
Wien E1071 group on probability theory, provides support for support vector 
machines (SVMs). Behind its R command interface runs an external C++ library 
(with a C API to interface with other languages) developed at the National Taiwan 
University. You can find more on LIBSVM for SVM classification and regression at 
http: //www.csie.ntu.edu.tw/-cjlin/libsvm/, together with plenty of data- 
sets, tutorials, and even a practical guide for getting more from SVMs. 


In addition, you get support functions for latent class analysis, short-time Fourier 
transform, fuzzy clustering, shortest-path computation, bagged clustering, and 
Naive Bayes classifiers. None of these uses of e1071 is directly demonstrated in the 
book, but you can find information on the basics for many of them, such as Naive 
Bayes (Chapters 9 and 12). 
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gbm 


https: //cran.r-project.org/web/packages/gbm/index.htm1 


As described in Chapter 18, the gradient boosting machines (GBM) algorithm uses 
gradient descent optimization to determine the right weights for learning in the 
ensemble. The resulting performance increase is impressive, making GBM one of 
the most powerful predictive tools that you can learn to use in machine learning. 
The gbm package adds GBM support to R. This package also includes regression 
methods for least squares, absolute loss, t-distribution loss, quantile regression, 
logistic, multinomial logistic, Poisson, Cox proportional hazards partial likeli- 
hood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank 
measures (LambdaMart). 


The package also provides convenient functions to cross-validate and to find out 
how to tune without overfitting the number of trees, a crucial hyper-parameter of 
the algorithm. The book just barely touches on the incredible potential of this 
addition to your toolbox. 


Gensim 


https: //radimrehurek.com/gensim/ 


Gensim is a Python library that can perform natural language processing (NLP) 
and unsupervised learning on textual data. It offers a wide range of algorithms to 
choose from: TF-IDF, random projections, latent Dirichlet allocation, latent 
semantic analysis, and two semantic algorithms: word2vec and document2vec 
(https: //code.google.com/archive/p/word2vec/). 


Word2vec is based on neural networks (shallow, not deep learning, networks) and 
it allows meaningful transformations of words into vectors of coordinates that 
you can operate in a semantic way. For instance, operating on the vector repre- 
senting Paris, subtracting the vector France, and then adding the vector Italy 
results in the vector Rome, demonstrating how you can use mathematics and the 
right Word2vec model to operate semantic operations on text. 


gimnet 


https: //web.stanford.edu/~hastie/glmnet/glmnet_alpha. html 
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The “Guessing the Right Features” section of Chapter 15 discusses regularization 
as an effective, fast, and easy solution to use when you have many features and 
want to reduce the variance of the estimates due to multicollinearity between your 
predictors. One form of regularization discussed in that chapter is Lasso, which is 
one of the forms of support you get from glmnet (with the other being elastic- 
net). This package fits the linear, logistic and multinomial, Poisson, and Cox 
regression models. You can also use this package to perform prediction, plotting, 
and K-fold cross-validation. Professor Rob Tibshirani, the creator of the L1 (also 
known as Lasso) regularization also helped develop this package. The easiest way 
to get this package is by downloading it from https: //cran.r—-project.org/ 
web/packages/glmnet/index.html. In addition, Gensim provides multiprocess- 
ing and out-of-core capabilities, allowing you to speed up the processing of algo- 
rithms and handle textual data larger than available RAM memory. 


randomForest 


https: //cran.r-project.org/web/packages/randomForest/index.htm1 


As described in Chapter 18, you can improve a decision tree by replicating it many 
times and averaging results to get a more general solution. The R open source 
package for performing this task is randomForest. You can use it to perform clas- 
sification and regression tasks based on a forest of trees using random inputs. The 
Python version of this package appears as RandomForestClassifier and Random- 
ForestRegressor, both of which are found in Scikit-learn, a package used for most 
of the Python examples in the book. 


http: //www.scipy.org/ 


The SciPy stack contains a host of other libraries that you can also download sepa- 
rately. These libraries provide support for mathematics, science, and engineering. 
When you obtain SciPy, you get a set of libraries designed to work together to cre- 
ate applications of various sorts. These libraries are 


>> NumPy 
>> SciPy 
>> matplotlib 
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>> |Python 
>> Sympy 
>> pandas 


The SciPy library itself focuses on numerical routines, such as routines for numer- 
ical integration and optimization. SciPy is a general-purpose library that provides 
functionality for multiple problem domains. It also provides support for domain- 
specific libraries, such as Scikit-learn, Scikit-image, and statsmodels. To make 
your SciPy experience even better, try the resources at http://www.scipy- 
lectures .org/. The site contains many lectures and tutorials on SciPy’s functions. 


XGBoost 
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https: //github.com/dmlc/xgboost 


Even though the book specifically covers GBM in Chapter 18, other types of gradi- 
ent boosting machines exist that are based on a slightly different set of optimiza- 
tion approaches and cost functions. The XGBoost package enables you to apply 
GBM to any problem, thanks to its wide choice of objective functions and evalua- 
tion metrics. It operates with a variety of languages, including Python, R, Java, 
and C++. 


In spite of the fact that GBM is a sequential algorithm (and thus slower than oth- 
ers that can take advantage of modern multicore computers), XGBoost leverages 
multithread processing in order to search in parallel for the best splits among the 
features. The use of multithreading helps XGBoost turn in an unbeatable perfor- 
mance when compared to other GBM implementations, both in R and Python. 
Because of all that it contains, the full package name is eXtreme Gradient Boosting 
(or XGBoost for short). You can find complete documentation for this package at 
https: //xgboost.readthedocs.org/en/latest/. 
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IN THIS CHAPTER 


Understanding the best ways to 
improve your model 


Avoiding snooping and other self- 
deceptions 


Figuring out the best ways to optimize 
your problem 


Getting the right parameters for the 
best result 


Exploring solutions from simplest to 
most complex 


Putting different solutions together 
Using one solution to predict another 
Creating and engineering new features 


Setting less useful features and 
variables apart 


Offering algorithms more chances to 
learn 


Ten Ways to Improve 
Your Machine 
Learning Models 


ow that your algorithm has finished learning from the data obtained using 
Python or R, you’re pondering the results from your test set and wonder- 
ing whether you can improve them or have really reached the best possible 
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REMEMBER 


outcome. This chapter introduces you to a number of checks and actions that hint 
at methods you can use to improve machine learning performance and achieve a 
more general predictor that’s able to work equally well with your test set or new 
data. This list of ten techniques offers you opportunities to improve the outcome 
achieved using machine learning algorithms. 


Monitoring learning performance using a training set helps you keep an eye on 
how the algorithm is doing. Training results are always too optimistic because, as 
learning occurs, some data memorization happens as well. The following advice 
helps you achieve a better result when using a test set. Only a k-folds cross- 
validation estimate or out-of-sample results are predictive of how well your 
solution will work when using new data. 


Studying Learning Curves 
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As a first step to improving your results, you need to determine the problems with 
your model. Learning curves require you to verify against a test set as you vary the 
number of training instances. (A cross-validation estimate will suffice when you 
don’t have an out-of-sample set available.) You’ll immediately notice whether 
you find much difference between your in-sample and out-of-sample errors. A 
wide initial difference (low in-sample error and high out-of-sample error) is a 
sign of estimate variance; conversely, having errors that are both high and similar 
is a sign that you’re working with a biased model. 


You also need to understand how the model behaves when the sample size 
increases. When variance is the problem, the in-sample error should increase as 
the number of training examples grows because the model will find it harder to 
memorize all the new instances you are feeding to it. On the other hand, the out- 
of-sample error should decrease as the model learns more rules in light of the 
increased evidence provided by a larger sample. 


When bias is the problem, you should observe a pattern similar to the one described 
previously, but on a smaller scale. More data benefits a biased model only up to a 
certain number of training examples. At a certain point, no matter how much 
additional data you provide, the model’s out-of-sample performance won’t 
improve. The earlier the lack of improvement happens, the more biased your 
model is. When you notice bias, more data provides little benefit to your results, 
and you should change models or try to make the present model more complex. 
For instance, when using a linear model, you can rely on interactions or complete 
polynomial expansion to reduce bias. 
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TIP 


Variance problems react well when you supply more data. To correct this problem, 
determine the training set size where the out-of-sample and in-sample error 
curves converge, and check whether you can obtain enough data. When getting 
enough data isn’t viable (because, for instance, the curves are too far away), you 
have to introduce model corrections based on variable selection or hyper-parameter 
tuning. 


Python helps you easily draw learning curves using the Scikit-learn() function. 
You can also easily achieve the same result using R with custom functions, as 
described by the Revolution analytics blog at http: //blog.revolutionanalytics. 
com/2015/09/why-big-data—learning-curves .html. 


Using Cross-Validation Correctly 


REMEMBER 





REMEMBER 


Seeing a large difference between the cross-validation (CV) estimates and the 
result is a common problem that appears with a test set or fresh data. Having this 
problem means that something went wrong with the cross-validation. Beyond the 
fact that CV isn’t a good performance predictor, this problem also means that a 
misleading indicator has induced you to model the problem incorrectly and 
achieve unsatisfactory results. 


Cross-validation provides you with hints when the steps you take (data prepara- 
tion, data and feature selection, hyper-parameter fixing, or model selection) are 
correct. It’s important, but not critical, that CV estimates precisely replicate out- 
of-sample error measurements. However, it is crucial that CV estimates correctly 
reflect improvement or worsening in the test phase due to your modeling deci- 
sions. Generally, there are two reasons that the cross-validation estimates can 
vary from the true error results: 


>> Snooping: Information is leaking from the response to the model. The 
problem will also affect your test set when leaking is generalized, but it won't 
affect any new data you use out-of-sample. This problem often happens when 
you apply preprocessing on pooled training and test data. (When pooling, you 
vertically stack training and test matrices so that you can work on only a single 
matrix without having to replicate twice the same operations on data.) 


You can actually pool training and test sets together to more easily preprocess 
them. However, you have to keep in mind that you shouldn't use the test data 
for computing missing data inputs, parameter normalization, or dimensional- 
ity reduction. When you use the pooled data for any of these three purposes, 
information from the test set easily and unnoticeably leaks to your training 
process, making your work unreliable. 
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TIP 


>> Incorrect sampling: When you convert classes into data, a simple random 
sampling may prove inadequate. You should test stratified sampling, a 
statistical sampling method (https: //en.wikipedia.org/wiki/ 
Stratified_samp1ling) that assures that you draw the sample response 
classes in the same proportion as the training set. 


Python offers a stratified-k-folds CV sampler that you can read about at 
http: //scikit-learn.org/stable/modules/cross_validation. 
html#stratified-k-fold. R can stratify samples using the createFolds 
method of the caret library described at http: //www. inside-r.org/ 
packages/cran/caret/docs/createDataPartition when you provide the 
y parameter as a factor. (The caret library uses levels as strata for sampling.) 














Choosing the Right Error or Score Metric 
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TIP 


Trying to optimize an error metric based on the median error by using a learning 
algorithm based on the mean error won’t provide you with the best results unless 
you manage the optimization process in a fashion that works in favor of your 
chosen metric. When solving a problem using data and machine learning, you 
need to analyze the problem and determine the ideal metric to optimize. Examples 
can help a lot. You can get many of them from academic papers and from public 
machine learning contests that carefully define specific problems in terms of data 
and error/score metric. Look for a contest whose objective and data are similar to 
yours, and then check the requested metric. 


Contests can provide great inspiration on more than just error metrics. You can 
easily learn how to manage data applications, use machine learning tricks, and 
perform smart feature creation. For example, look at the knowledge base offered 
by websites such as Kaggle (https: //www.kaggle.com/competitions) that offer 
hundreds of available contests, all with data, best metrics, and plenty of hints 
from experts to help you learn from data in an optimal way. 


Check whether the machine learning algorithm that you want to use supports 
your chosen metric. If the algorithm uses another metric, try to influence it by 
searching for the best combination of hyper-parameters that can maximize your 
metric. You can achieve that goal by doing a grid search for the best cross-validate 
result using your own metric as a target. 
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Searching for the Best Hyper-Parameters 


TIP 


Testing 


TIP 


Most algorithms perform fairly well out of the box using the default parameter 
settings. However, you can always achieve better results by testing different 
hyper-parameters. All you have to do is to create a grid search among possible 
values that your parameters can take and evaluate the results using the right error 
or score metric. The search takes time, but it can improve your results (not drasti- 
cally, but significantly). 


When a search takes too long to complete, you can often achieve the same results 
by working on a sample of your original data. Fewer examples chosen at random 
require fewer computations, but they usually hint at the same solution. Another 
trick that can save time and effort is to do a randomized search (see Chapter 11, in 
the “Exploring the space of hyper-parameters” section, which also compares grid 
search to random search), thus limiting the number of hyper-parameter combi- 
nations to test. 


Multiple Models 


The no-free-lunch theorem should always be an inspiration for you, reminding 
you not to fall in love with certain learning approaches just because they brought 
interesting results in the past. As a good practice, test multiple models, starting 
with the basic ones — the models that have more bias than variance. You should 
always favor simple solutions over complex ones. You may discover that a simple 
solution performs better. For example, you may want to keep things simple and use 
a linear model instead of a more sophisticated, tree-based ensemble of models. 


Representing the performance of different models using the same chart is helpful 
before choosing the best one to solve your problem. You can place models used to 
predict consumer behavior, such as a response to a commercial offer, in special 
gain charts and lift charts. These charts show how your model performs by parti- 
tioning its results into deciles or smaller parts. Because you may be interested 
only in the consumers who are most likely to respond to your offer, ordering pre- 
dictions from most to least likely will emphasize how good your models are at 
predicting the most promising customers. These Quora answers help you see how 
gain and lift charts work: https: //www.quora.com/Whats-Li ft-curve and https: // 
www . quora. com/ROC—Curve—and-AUC/Whats—ROC-curve. 


Testing multiple models and introspecting them (understanding which features 
work better with them) can also provide suggestions as to which features to trans- 
form for feature creation, or which feature to leave out when you make feature 
selections. 
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Averaging Models 


REMEMBER 


Machine learning involves building many models and creating many different 
predictions, all with different expected error performances. It may surprise you to 
know that you can get even better results by averaging the models together. The 
principle is quite simple: Estimate variance is random, so by averaging many dif- 
ferent models, you can enhance the signal (the correct prediction) and rule out the 
noise that will often cancel itself (opposite errors sum to zero). 


Sometimes the results from an algorithm that performs well, mixed with the 
results from a simpler algorithm that doesn’t work as well, can create better pre- 
dictions than using a single algorithm. Don’t underestimate contributions deliv- 
ered from simpler models, such as linear models, when you average their results 
with the output from more sophisticated algorithms, such as gradient boosting. 


It’s the same principle you seek when applying ensembles of learners, such as tree 
bagging and boosting ensembles. However, this time you use the technique on 
complete and heterogeneous models that you prepare for evaluation. In this case, if 
the result needs to guess a complex target function, different models may catch 
different parts of that function. Only by averaging results output by different simple 
and complex models can you approximate a model that you can’t build otherwise. 


Stacking Models 
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TIP 


For the same reasons that averaging works, stacking can also provide you with 
better performance. In stacking, you build your machine learning models in two 
(or sometimes even more) stages. Initially this technique predicts multiple results 
using different algorithms, with all of them learning from the features present in 
your data. During the second phase, instead of providing features that a new 
model will learn, you provide that model with the predictions of the other, 
previously trained models. 


Using a two-stage approach is justified when guessing complex target functions. 
You can approximate them only by using multiple models together and then by 
combining the result of the multiplication in a smart way. You can use a simple 
logistic regression or a complex tree ensemble as a second-stage model. 


The Netflix competition provides evidence and a detailed illustration about how 
heterogeneous models can be stacked together to form more powerful models. 
However, implementing this solution as a working application can be quite 
cumbersome, as described at https: //www.techdirt.com/blog/innovation/ 
articles/20120409/03412518422/why-net fl1ix—never—implemented—algorithm- 
that-—won-net flix-1-million-challenge.shtml. 
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Applying Feature Engineering 





REMEMBER 


If you believe that bias is still affecting your model, you have little choice but to 
create new features that improve the model’s performance. Every new feature can 
make guessing the target response easier. For instance, if classes aren’t linearly 
separable, feature creation is the only way to change a situation that your machine 
learning algorithm cannot properly deal with. 


Automatic feature creation is possible using polynomial expansion or the support 
vector machines class of machine learning algorithms. Support vector machines 
can automatically look for better features in higher-dimensional feature spaces in 
a way that’s both computationally fast and memory optimal. 


However, nothing can really substitute for your expertise and understanding of 
the method needed to solve the data problem that the algorithm is trying to learn. 
You can create features based on your knowledge and ideas of how things work in 
the world. Humans are still unbeatable in doing so, and machines can’t easily 
replace them. Feature creation is more art than science, and an undoubtedly 
human art. 


Feature creation is always the best way to improve the performance of an 
algorithm — not just when bias is the problem, but also when your model is 
complex and has high variance. 


Selecting Features and Examples 


TIP 


If estimate variance is high and your algorithm is relying on many features (tree- 
based algorithms choose features they learn from), you need to prune some fea- 
tures for better results. In this context, reducing the number of features in your 
data matrix by picking those with the highest predictive value is advisable. 


When working with linear models, linear support vector machines, or neural net- 
works, regularization is always an option. Both L1 and L2 can reduce the influence 
of redundant variables or even remove them from the model (see the “Solving 
overfitting by using selection” section of Chapter 15). Stability selection leverages 
the L1 ability to exclude less useful variables. The technique resamples the train- 
ing data to confirm the exclusion. 


You can learn more about stability selection by viewing the example on the 
Scikit-learn website: http: //scikit—learn.org/stable/auto_examples/1linear_ 
model/plot_sparse_recovery.html. In addition, you can practice using the 
RandomizedLogisticRegression and RandomizedLasso Scikit-learn functions in 
the linear_model module. 
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Apart from stability selection, both R and Python offer some recursive greedy 
algorithms (see http: //ieeexplore. ieee.org/xpl/login. jsp?arnumber=1530718, 
for example) that test whether it’s worth keeping a feature in a model by trying 
the model when both present and absent. Only features that decrease model errors 
are kept in the learning process. In Python, you can check the feature_selection 
module and its recursive selector: http: //scikit-learn.org/stable/modules/ 
feature_selection.html#recursive—feature-elimination. 


R provides the same functionality in the r fe function of the caret library (http: // 
www. inside-r.org/packages/cran/caret/docs/rfe), which implements a 
recursive backward feature selection. This technique removes features from the 
model one by one, using the complete data matrix until performance starts 
decreasing (meaning that the algorithm started removing important features). 


Looking for More Data 
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After trying all the previous suggestions, you may still have a high variance of 
predictions to deal with. In this case, your only option is to increase your training 
set size. Try increasing your sample by providing new data, which could translate 
into new cases or new features. 


If you want to add more cases, just look to see whether you have similar data at 
hand. You can often find more data, but it may lack labeling, that is, the response 
variable. Spending time labeling new data or asking other people to label it for you 
may prove a great investment. Complex models can improve a lot from additional 
training examples because adding data makes parameter estimation much more 
reliable and disambiguates cases in which the machine learning algorithm can’t 
determine which rule to extract. 


If you want to add new features, locate an open source data source, if possible, to 
match your data with its entries. Another great way to obtain both new cases and 
new features is by scraping the data from the web. Often, data is available between 
different sources or through an application programming interface (API). For 
instance, Google APIs (https: //developers.google.com/apis—explorer/#p/) 
offer many geographical and business information sources. By scripting a scrap- 
ing session, you can obtain new data that can put a different perspective on your 
learning problem. New features help by offering alternative ways to separate your 
classes, which you do by making the relationship between the response and the 
predictors more linear and less ambiguous. 
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Symbols 

- operator, 113, 114 
l= operator, 115 

* operator, 113 

*= operator, 112 

** Operator, 113 
**= operator, 113 

/ operator, 113, 114 
/= operator, 113 

// operator, 113 

//= operator, 113 

% operator, 113 

%= operator, 113 

+ operator, 113, 114 
+= operator, 112 





< operator, 115 

<= operator, 115 

> operator, 115 

>= operator, 115 

= operator, 112 

-= operator, 112 

== operator, 115, 124 

~ operator, 114 

& (And) operator, 114 
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^ (Exclusive or) operator, 114 
<< (Left shift) operator, 114 





~ (One's complement) operator, 114 


| (Or) operator, 114 
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>> (Right shift) operator, 114 


A 


a posteriori, 194 
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access control, Al and machine learning for, 17-18 


accuracy measure, 367 


activation function, 280-281 

Adaboost, 324-327 

adaptation, machine learning and, 39 
Adaptive Boosting (Adaboost), 324-327 
AFINN-111 dictionary, 365 

Al. See artificial intelligence (Al) 

Al winter, 44 

Air Quality (airquality) dataset, 60 
algorithms, 23, 30-32. See also specific algorithms 
Amelia (robot), 12 

AMORE (website), 290 

Anaconda, 91-99, 109-135 
Analogizers, 29, 32 

analyzation, machine learning and, 38 
and operator, 115 

androids, 11-12 


animal protection, Al and machine 
learning for, 18 


anomalous data, delimiting, 232-236 
Antikythera mechanism (website), 12 
Apache Singa, 139 
Apache Spark MLlib, 139-140 
append( ) function, 130 
arguments, 68, 70, 120-123 
arithmetic operators, 113-114, 116 
array() function, 71-72 
arrays, 71-72, 151 
art, behind engineering, 20-21 
artificial intelligence (Al) 
about, 9-10 
compared to machine learning, 13 
history of, 12-13 
hype about, 10 
relationship with machine learning, 19 
resources on, 11 
specifications of, 20 
uses for, 15-18 
as .Date() function, 79 
as. integer() function, 82 
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assignment operators, 112-113, 116 
assumptions, checking, 241-242 
Autoencoders (website), 294 

automation, Al and machine learning for, 17 


backpropagation, 31, 286-289 
Bag of Words (BoW), 351-355 
bagging, 318, 324-327 
balanced solutions, 188-190 
Barplot() function, 87 
batch mode, 289 

Bayes, Thomas (statistician), 162 
Bayes’ theorem, 160-163, 210 
Bayesian inference, 29, 31-32 
Beautiful Soup package, 358 
Becker, Rick (programmer), 48 


Beginning Programming with Python For Dummies 
(Mueller), 110 


bias, 184-186, 196-198 

big data, 23-28 

binary classification, 201 

binary response, specifying, 267-270 
bitwise operators, 114, 116 

Blade Runner (film), 11 

Boolean values, 111, 115 

bootstrap aggregation, 318, 324-327 
bootstrapping, 193, 197, 318 

Boston dataset, 60 

boxplot, 234 

Breiman, Leo (professor), 318 
byrow argument, 68 


C 


c() function, 66, 67, 68 

Caffe library, 140, 339 

caller, 120 

Canopy Express, 92-93 

caret package, 195, 398 
centroids, 247-251 

character data types, 64 
charting data, 87 

charToRaw( ) function, 64-65 
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Cheat Sheet (website), 4 

Chebyshev distance, 240 

class( ) function, 65, 67 

classes, multiple, 270-271 

classification 
of images. See image classification 
software implementations for, 308-309 
using, 362-368 

clause, 125 

Cloudera Oryx package, 386 

clusters, locating with distances, 240-244 

code repository, 60-62, 101-106 

coding, 63-87, 109-135 

coefficient vector, 201 

collaborative filtering, 374-375, 378 

colnames( ) function, 70-71, 73 

columns, 70-74, 79 

companion files (website), 4-5 

comparison operators, 116 

complex analysis, Al and machine learning for, 17 

complex data types, 64 

complex number, 111 

Comprehensive R Archive Network (CRAN), 49, 58 

compressing data, 230-232 

conditional probability, 161, 210 

conditional statements, 124-128 

connectionism, 201 

Connectionists, 29, 31 

content-based filtering, 374-375 

continue clause, 128 

control, machine learning and, 39 

ConvNetJS package, 387 

convolutional networks, 338-339 

core memory, 177 

corpus, 351 

cosine similarity, 375 

cost functions, 173-174 

Create, Read, Update, and Delete (CRUD), 23 

cross-validation (CV), 191-196, 393-394 

CUDA-Convnet package, 386-387 

current directory, 135 

customer service, Al and machine learning for, 17 

custom-made kernels, 306 

Cutler, Adele (professor), 318 


D 


data. See also big data 
charting, 87 
comparing using Boolean expressions, 115 
indexing, 134 
looking for more, 398 
organizing using lists, 66-67 
preprocessing. See preprocessing data 
splitting, 204-208 
storing, 128-132 
types of, 64-65 
working with, 148-158 
data argument, 68, 72 
data frames, 74-80 
data stream, 275 
datasets 
downloading for Python, 99-108 
downloading for R, 59-60 
used in this book, 106-108 
dates, interacting with, 118-119 
Dean, Jeff (Google employee), 294 
Debian (website), 56 
decision trees 
about, 204 
ensembles and, 316-323 
predicting outcomes by splitting data, 204-208 
pruning, 208-209 
decision-making, 80-82 
Deep Belief Networks (website), 294 
deep learning, 44, 293-295 
Deep Mind, 169 
default directories, 135 
default values, 122 
delimiting anomalous data, 232-236 
Descartes, René (philosopher), 12 
deterministic actions, 208 
Dick, Philip K. (author), 11 
dictionaries, 130, 134 
difference() function, 129 
dim argument, 72 
dimnames( ) function, 68, 72, 73 
DisplayMulti() function, 127 
DisplaySum() function, 122 


distances 
computing for learning, 239-240 
locating clusters with, 240-244 
distributions, 164, 225-227 
Domingos, Pedro (scientist), 29 
DoSum() function, 121 
draw with repetition, 193 
drop-out, 295 
dual formulation, 305 


e1071 package, 387 
Eclipse, 143 
Eigenfaces, 340-343 
Einstein, Albert (scientist), 188 
elements, of matrices, 69-70 
elif clause, 125 
else clause, 125 
email spam detection, 213 
engineering, 20-21, 29 
enrichment, machine learning and, 38 
ensembles 
about, 315-316 
almost random guesses, 324-327 
boosting smart predictors, 327-329 
decision trees and, 316-323 
predictors, 329-330 
Enthought Canopy Express, 92-93 
environment variables, 135 
equality operators, 116, 124 
error function, 173-174 
error metric, 394 
escape character, 129 
Euclidean distance, 239-240, 242, 249, 375 
evaluation, training and, 33 
Evolutionaries, 29, 31 
example code, 59-62, 99-108 
examples, selecting, 397-398 
expanding data frames, 79-80 
expectations, checking, 241-242 
exploratory data analysis (EDA), 233 
Exploratory Data Analysis (Tukey), 233 
exponentiation, 116 
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extend() function, 130 
Extract, Transform, Load (ETL) model, 143 


F 


faces, recognizing with eigenfaces, 340-343 
factor ( ) function, 74 
factors, 74-76 
features 
creating, 227-229, 397 
defined, 149 
outcome of, 271-272 
selecting, 397-398 
transforming, 298 
feed-forward input, 281-283 
files, 61-62 
Fisher, Ronald (statistician), 245, 283 
fixed size trees, 329 
float() command, 117 
floating point value, 110 
FNN (website), 253 
for loop, 84, 126-127 
frameworks, reducing workload with, 139-140 
fraud detection, Al and machine learning for, 16-17 
Freund, Yoav (Adaboost developer), 324 
from. ..import statement, 134 
function() function, 85 
functions. See also specific functions 
about, 119 
calling, 121-123 
cost, 173-174 
reusable, 119-120 
working with, 85 


G 


gain and lift charts, 395 

Gaussian filter, 336 

gbm package (website), 329, 388 
generalization, 183-184 

Gensim package, 388 

Geographical Information Systems (GIS), 26 
gini impurity, 205, 322 

glmnet package, 388-389 

global variables, 123-124 


402 Machine Learning For Dummies 


Google, 28 

Google APIs, 398 

Google Brain project, 294 
Google TensorFlow, 140 


gradient boosting machines (GBM) algorithm, 
327-329 


gradient descent, 174-176, 275, 328-329 
gradient descent optimization, 260 
Graphical Processing Units (GPUs), 15 
ground truth, 219 


H 


hardware, role of updated, 39 
Hastie, Trevor (professor), 273 
health care, machine learning in, 37 
heuristics, 242 

Hinton, Geoffrey (professor), 44, 293 
Hobbes, Thomas (philosopher), 12 
Homebrew, 58 

humanoid robots, 11 
hyper-parameters, 171, 194, 195-196, 308-309, 395 
hyperplane, 259 

hypothesis, 10 

hypothesis space, 171 


icons, explained, 3-4 
identity operators, 117, 118 
if statement, 81, 124-125 
if...else statement, 81 
image classification 
about, 333-334 
extracting visual features, 338-340 
process of, 343-347 
recognizing faces with Eigenfaces, 340-343 
working with sets of images, 334-338 
image recall, 338 
immutable, 131 
import datetime command, 118 
import statement, 134-135 
imshow( ) function, 335 
in operator, 118 
incorrect sampling, 394 


index, 150 
Induction of Decision Trees (Quinlan), 206 
industrial settings, machine learning in, 38-39 
information gain, 205 
informative entropy, 205 
input() function, 126 
input layer, 282 
installing 
Python on Linux, 93-94 
Python on Mac OS X, 94-96 
Python on Windows, 96-99 
R on Linux, 56-57 
R on Mac OS X, 57-59 
R on Windows, 49-55 
install .packages() function, 86 


Institute of Electrical and Electronics 
Engineers (IEEE), 20 


int() function, 117, 126 
integer, 110 
integer data types, 64 


International Organization for Standardization/ 
International Electrotechnical 
Commission (ISO/IEC), 20 


InterQuartile Range (IQR), 233-234 
intersection() function, 129 
Iris dataset, 60 

iRobot Roomba, 37 

is not operator, 118 

is operator, 118 

iteration, 132-133, 275 

ITK package (website), 334 


J 


Japanese robots (website), 12 
Jupyter Notebook, 100-101 


K 


Kaggle (website), 394 

kernel trick, 305-306 

keyword, sending arguments by, 121-122 
k-folds, 192-193 

K-means algorithm, 241-251 


K-Nearest Neighbors (KNN) algorithm, 199, 251, 298 


Knime, 143 


knowledge based recommendations, 374-375 
knowledge representation, 19 
k-parameter, 252-256 


L 


L1 regularization, 272-273 
L2 regularization, 272-274 
Labeled Faces in the Wild dataset, 343 
labeling, 338 
Laplace correction, 211 
Lasagne (website), 290 
Lasso, 272-273 
law of large numbers, 316 
leakage traps, 196-198 
learning, process of, 170-172 
learning curves, 189-190, 392-393 
learning rate, 202 
leave-one-out cross-validation (LOOCV), 193 
Leibniz, Gottfried (philosopher), 12 
length( ) function, 158 
levels( ) function, 74 
library( ) function, 86 
LIBSVM, 141-142 
linear kernel, 306 
linear models 
about, 257-258 
combining variables, 258-264 
features, 271-274 
mixing variables, 264-267 
one-example-at-a-time approach, 274-278 
probabilities, 267-271 
link function, 269 
Linux, 56-57, 93-94 
list() function, 66 
lists, 66-67, 129-131 
Literary Digest, 183 
local variables, 123-124 
Loew, Judah (Rabbi), 12 
logical data types, 64, 110-117 
logical operators, 115-116, 117 
logistic regression, 298 
loop statements, 124-128 
loops, 82-84 
loss function, 173-174 
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M 


Mac OS X, 57-59, 94-96 
machine efficiency, Al and machine learning for, 17 
machine learning. See also specific topics 
about, 10, 167-168 
compared to Al, 13 
compared to statistics, 14 
cost functions, 173-174 
goals of, 13-14 
gradient descent, 174-176 
in health care, 37 
history of, 12-13 
in industrial settings, 38-39 
limits of, 14-15 
new work opportunities with, 39-43 
process of learning, 170-172 
reinforcement learning, 169 
relationship with Al, 19 
role of in robots, 36-37 
role of statistics in, 29 
specifications of, 20 
supervised learning, 168 
unsupervised learning, 169 
updating, 177-180 
uses for, 15-18 
validating, 181-198 
machines, how they read, 350-361 
Macready, William (mathematician), 188 
Mahotas (website), 334 
Manhattan distance, 240 
mapping, 171 
MapReduce, 144 
margin, 300 
math 
about, 147-148 
basic operations, 152 
creating matrices, 150-152 
probabilities, 158-163 
statistics, 163-165 
of SVMs, 301-302 
vectorization, 155-158 
working with data, 148-158 
matnames( ) function, 73 
matrices, 67-70, 150-152 
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matrix() function, 68-69, 70 
matrix inversion, 155 
maximum margin, 300 
mean, 85-87, 164 
mean decrease accuracy, 322 
mean decrease impurity, 322 
mean( ) function, 85-87 
meaningful data, 219 
median, 85-87, 164 
Median filter, 336 
median() function, 85-87 
member operators, 117 
membership operators, 117, 118 
Milborrow, Stephen (author), 207 
mini-batch mode, 179, 289 
min-max transformation, 227 
missing data, 221-225 
mlv() function, 86-87 
models 

averaging, 396 

complexity of, 186-187 

stacking, 396 

testing multiple, 395 

tips for improving, 391-398 
modules, storing code in, 134-135 
MovieLens, 371-373, 378-382 
movies, recommended, 369-382 
MSWeb dataset, 373-374 
Mueller, John (author) 
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For Dummies, 110 


multiple linear regression, 259 
multiplying matrices, 152-155, 201 
mutually exclusive, 159 


N 


Naive Bayes algorithm, 161, 209-215 

named entity recognition, 350 

Natural Language Processing (NLP), 19, 349-350 
Natural Language Toolkit (NLTK), 355 

natural language understanding, 19 

ncol argument, 68 

Nervana Neon, 140 

nesting, 125-126 


Netflix, 396 
network parallelism, 179 
neural networks 
about, 279-280 
backpropagation, 286-289 
deep learning, 293-295 
neurons, 280-289 
overfitting, 289-293 
neuralnet (website), 290 
neurons, 280-289 
Ng, Andrew (Google employee), 294 
Ngram Viewer, 358 
no-free-lunch theorem, 188 
Nolearn (website), 290 
nonlinearity, applying, 303-308 
nonlinearly separable, 202-204 


Non-Negative Matrix Factorization (NMF) 
decomposition class, 362 








nonnegativity, 239 
nonresponsive bias, 183 
nonseparability, pitfalls of, 302-303 
normal distribution, 226 
normalization, 227 

not in operator, 118 

not operator, 116 

notebooks, 103-106 

nrow argument, 68 

numbers, working with, 110-117 
numeric data types, 64 


NumPy package, 156, 157, 223, 227, 
334, 364, 385, 389 


O 


objective function, 173-174 
Occam's razor, 188 

On The Web icon, 4 

One Versus All (OVA), 201 

One Versus One (OVO), 270-271 
One Versus Rest (OVR), 270 
one-example-at-a-time approach, 274-278 
one-hot encoding, 264 

online learning, 179 

online mode, 288-289 

OpenCV package (website), 334, 339 
OpenSUSE (website), 56 


operations, 152, 159-160 
operator precedence, 116 
operators. See specific operators 
opinions and sentiments 
about, 349 
how machines read, 350-361 
Natural Language Processing (NLP), 349-350 
using scoring and classification, 362-368 
optimal margin hyperplane, 300 
optimization 
machine learning and, 39 
training and, 33 
or operator, 116 
ordinary least squares (OLS), 260 
outcomes, predicting, 204-208 
outliers, 232-233, 265 
out-of-core learning, 275 
out-of-sample errors, checking, 182-184 
output layer, 282 
overestimation, 191 
overfitting 
about, 187, 289-290 
building neural networks, 290-293 
solving, 272-274 
Oxdata H20, 140 


P 


packages, recommended, 385-390. See also specific 
packages 


pasting, 318 
Pep-8 style guide, 111 
perceptron, 200-204 
placeholder, 129 
planning, 19 
polynomial expansion, 266-267 
polynomial kernel, 306 
pooling layers, 339 
population, 164 
positional arguments, 121 
predicting wait times, Al and machine learning for, 18 
predictors, 259, 329-330 
preprocessing data 
about, 219-220 
compressing data, 230-232 
creating features, 227-229 
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preprocessing data (continued) sets, 128-129 


delimiting anomalous data, 232-236 strings, 117-118 
gathering and cleaning data, 220-221 style concerns with, 111-112 
repairing missing data, 221-225 tuples, 131-132 
transforming distributions, 225-227 unary operators, 114, 116 
primal formulation, 305 versions, 90, 110 
Principal Components Analysis (PCA), 231, pythonxy Integrated Development 
242, 253, 362 Environment (IDE), 93 
print() function, 66, 67, 68-69, 73, 122, 
123-124, 125 
printStr() function, 123-124 Q 
private sources, obtaining data from, 26-27 qualitative features, 149, 222 
probability quantitative features, 149, 222 
about, 158-159 queues, 130 
Bayes’ theorem, 160-163 Quinlan, John Ross (computer scientist), 149, 205, 206 


multiple classes, 270-271 

operations and, 159-160 R 

specifying binary response, 267-270 
processors, role of updated, 39 
products, recommended, 369-382 
pruning decision trees, 208-209 
public sources, obtaining data from, 26 
Python 

about, 18, 89 

arithmetic operators, 113-114, 116 

assignment operators, 112-113, 116 

bitwise operators, 114, 116 

choosing, 90-93 

coding in using Anaconda, 109-135 

conditional and loop statements, 124-128 

dates, 118-119 

dictionaries, 134 

functions, 119-124 

identity operators, 117, 118 

installing on Linux, 93-94 

installing on Mac OS X, 94-96 

installing on Windows, 96-99 

iterators, 132-133 

lists, 129-131 

logical operators, 115-116, 117 

membership operators, 117, 118 

modules, 134-135 

numbers and logic, 110-117 


R 
about, 18, 47-48 
arrays, 71-74 
basic data types, 64-65 
choosing, 48-49 
coding in, 63-87 
creating data frames, 74-80 
downloading datasets, 59-60 
downloading example code, 59, 60-62 
installing on Linux, 56-57 
installing on Mac OS X, 57-59 
installing on Windows, 49-55 
matrices, 67-71 
organizing data using lists, 66-67 
resources on, 65 
statistical tasks, 80-87 
vectors, 66 
Radial Basis Function (RBF) kernel, 306-307 
Random Forests, 318-323 
random noise component, 230 
random sampling, 164, 178 
randomForest package, 389 
RapidMiner, 143 
Rattle, 49 
raw text, 360-361 
rbind() function, 80 
recognizing faces, 340-343 
recommender systems, 370-371 


operator precedence, 116 
relational operators, 115 
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Red Hat (website), 56 

Red-R, 49 

regression result, 269 
regularization, 272-274 
reinforcement learning, 169 
relational operators, 115 
Remember icon, 4 

remove() function, 130 
repeat loop, 82-83 

repetitive tasks, 126-127 
replacement strategy, 222-225 
representation, training and, 33 
representation capability, 286 
reproducibility, 242 

required arguments, sending, 121 


resource scheduling, Al and machine learning for, 17 


Restricted Boltzanman Machines (website), 294 
return statement, 121 
reusable functions, 119-120 
Revolution analytics blog (website), 393 
Richardson, Leonard (developer), 358 
Ridge, 272-274 
robotics, 19 
robots, role of machine learning in, 36-37 
Rocket Fuel, 176-177 
Roomba (iRobot), 37 
Rosenblatt, Frank (psychologist), 201-202 
Rosie the robot, 36 
rownames( ) function, 70-71, 73 
rows, 70-74, 79-80 
RStudio 

about, 49 

coding in R using, 63-87 

installing on Linux, 56-57 

installing on Mac OS X, 58-59 

installing on Windows, 49-55 


S 


safety systems, Al and machine learning for, 17 
sample bias, 196-198 

SAS, 139-140 

SayHello() function, 122 

scatterplot, 235-236 

Schapire, Robert (Adaboost developer), 324 
science, 20, 29 


Scikit-learn package, 190, 195, 215, 244, 248, 313, 
340, 397 


SciPy package, 389-390 
scipy.ndimage package (website), 334 
score metric, 394 
scoring, 362-368 
scoring function, 173-174 
screenshots, 55, 99 
SecretNumber () function, 126, 127-128 
sentiment analysis, 213 
sentiments. See opinions and sentiments 
Set class, 128 
sets, 128-129 
shared variance, 230 
Shogun, 140 
show( ) function, 335 
shrinkage, 328 
sigmoid kernel, 306 
signal, 396 
similarity 
about, 237-238 
k parameter, 252-256 
K-means algorithm, 244-251 
K-Nearest Neighbors (KNN) algorithm, 251 
locating clusters using distances, 240-244 
measuring between vectors, 238-240 
simple linear regression, 259 


Singular Value Decomposition (SVD), 231, 362, 
375-382 


Siri, 9, 38 

The Six Million Dollar Man (TV series), 16 
SMS Spam Collection (sms_data) dataset, 60 
snooping, 198, 393 

Spark, 144 

SPSS, 139-140 

stacking models, 396 

stacks, 130 

standard deviation, 164 

Stata, 139-140 

StatET, 49 


statistical standardization (z-score 
normalization), 226 


statistical tasks 
about, 80 
charting data, 87 
finding mean and median, 85-87 
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statistical tasks (continued) 
functions, 85 
loops, 82-84 
making decisions, 80-82 
performing looped tasks without loops, 84 
statistics, 14, 29, 163-165 
Statistics vs. Machine Learning, fight! (website), 29 
stemming, 355-357 
stochastic gradient, 180 
stochastic gradient descent (SGD), 275-278 
stochastic mode, 289 
stop words, removing, 355-357 
stopping rules, 206 
str() function, 77-78, 117, 122 
stratified sampling, 165, 178 
streaming, 179, 275 
strings, creating and using, 117-118 
Structured Query Language (SQL), 41 
subsampling, 178, 328 
summary () function, 78 
supervised learning, 168 
support vector machines (SVMs) 
about, 297-298 
algorithm, 299-303 
applying nonlinearity, 303-308 
classifying and estimating with, 309-313 
hyper-parameters, 308-309 
separation problem, 298-299 
support vectors, 300-301 
survivorship bias, 184 
switch statement, 82 
symbolic reasoning, 30-31 
Symbolists, 29, 30-31 
symmetry, 239 


T 


table() function, 75 
tag soup, 358 
Talos (mechanical man), 12 
target function, 170 
Technical Stuff icon, 3 
techniques 

about, 199-200 

decision trees, 204-209 
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Naive Bayes, 209-215 

perceptron, 200-204 
technologies 

about, 35-36 

creating for the future, 36-39 


new work opportunities with machine learning, 
39-43 


potential pitfalls of future, 43-44 
TensorFlow (website), 290 


term frequency-inverse document 
frequency (TF-IDF) transformation, 
353-354, 363 


testing, 191 
text, 352-357, 360-361 
text-classification, 213 
text-processing tasks, 213 
textual datasets, from the Web, 357-360 
Theano (website), 290 
theory, 10 
time() command, 119 
Tip icon, 3 
Titanic dataset, 60 
token, 351-352 
tools. See also Python; R 
about, 137-138 
Knime, 143 
LIBSVM, 141-142 
RapidMiner, 143 


reducing workload with frameworks and, 
139-140 


SAS, 138-139 

Spark, 144 

SPSS, 138-139 

Stata, 138-139 

Vowpal Wabbit, 142 

Weka, 140-141 
Total Variation Denoising filter, 336 
training, 32-33, 191 
tree-based features selection, 322 
triangle inequality, 239 
tribes, 29 
Tukey, John (author) 

Exploratory Data Analysis, 233 
tuples, 129, 131-132 
Turing, Alan (author), 12 


U 


Ubuntu (website), 56 

unary operators, 114, 116 
uncorrelated, 318 
underestimation, 191 
understandability, 242 
unigrams, 355 

union() function, 129 
unique variance, 230 

units. See neurons 

universal approximators, 286 


Universal Transformation Format 8-bit (UTF-8), 
360-361 


unsupervised learning, 169 
updates, book (website), 4 
updating, 177-180 

upper () function, 127 
usability, 242 


V 


validating 
about, 181-182, 191 
alternatives in, 193-194 
avoiding sample bias and leakage traps, 196-198 
balanced solutions, 188-190 
bias, 184-186 
cross-validation, 191-193 
model complexity, 186-187 
optimizing cross-validation choices, 194-196 
out-of-sample errors, 182-184 
testing, 191 
training, 191 
values( ) function, 134 
Vapnik, Vladimir (mathematician), 299 
variable assignments, performing, 112-113 
variables 
combining, 258-264 
defined, 149 
global, 123-124 
local, 123-124 
mixing, 264-267 
variance reduction, 205 
vector of coefficients, 151 
vectorization, 155-158 


vectors 
about, 64 
changing arrangement of, 69 
defined, 151 
measuring similarity between, 238-240 
working with, 66 
Versicolor, 249 
versions, Python, 90, 110 
visual features, extracting, 338-340 
Vowpal Wabbit, 142 


W 


Warning icon, 3 
Web scraping, 357-360 
websites 
Air Quality (airquality) dataset, 60 
Amelia (robot), 12 
AMORE, 290 
Anaconda, 91 
Antikythera mechanism, 12 
Autoencoders, 294 
book updates, 4 
Boston dataset, 60 
caret package, 195, 398 
Cheat Sheet, 4 
Cloudera Oryx package, 386 
companion files, 4-5 
Comprehensive R Archive Network (CRAN), 49 
ConvNetJS package, 387 
CUDA-Convnet package, 386-387 
Debian, 56 
Deep Belief Networks, 294 
Deep Learning, 44 
dictionaries, 365 
dual formulation, 305 
e1071 package, 387 
Eclipse, 143 
encodings, 360 
Enthought Canopy Express, 92-93 
FNN, 253 
gain and lift charts, 395 
gbm package, 329, 388 
Gensim package, 388 
glmnet package, 388-389 
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websites (continued) 
Google APIs, 398 
Graphical Processing Units (GPUs), 15 
Homebrew, 58 
image classification, 333 
Iris dataset, 60 
iRobot Roomba, 37 
ITK package, 334 
Japanese robots, 12 
Kaggle, 394 
Knime, 143 
Labeled Faces in the Wild dataset, 343 
languages, 20 
Lasagne, 290 
law of large numbers, 316 
LIBSVM, 141 
Mahotas, 334 
MapReduce, 144 
MovieLens, 371-373 
Natural Language Toolkit (NLTK), 355 
neuralnet, 290 
no-free-lunch theorem, 188 
Nolearn, 290 
OpenCV package, 334, 339 
OpenSUSE, 56 
Pep-8 style guide, 111 
Python, 18 
Python issues, 91 





pythonxy Integrated Development Environment 


(IDE), 93 
R language, 48, 65 
Random Forests, 320 
randomForest package, 389 
RapidMiner, 143 
Rattle, 49 
recommender systems, 370-371 
Red Hat, 56 
Red-R, 49 
Restricted Boltzanman Machines, 294 
Revolution analytics blog, 393 
Rocket Fuel, 176-177 
RStudio, 49 
SAS, 139 
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Scikit-learn package, 190, 195, 215, 244, 248, 
313, 340, 397 


SciPy package, 389-390 
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